[03:19:26] Hi all :) mmm getting some 500 internal server error response codes from Druid, both when querying from Jupyter and also through Pivot... [03:20:24] Oh hmm seems to work now! [03:55:31] Hmmm only on some queries it seems, lemme see...... [08:58:47] hello people! [08:59:03] rolling restart of the remaining yarn nodemanagers to pick up the new settings [08:59:09] 1 every 2 minutes [09:37:38] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Enable more accurate smaps based RSS tracking by yarn nodemanager - https://phabricator.wikimedia.org/T182276#3843988 (10elukey) @EBernhardson the config should be now live everywhere, I keep checking metrics in https://grafana.wik... [10:13:02] 10Analytics-Kanban, 10DBA, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3844059 (10elukey) [10:30:58] joal: o/ [10:31:20] whenever you have time can we chat/brainstorm about what we'd need to test java 8 on hadoop? [11:37:18] * elukey lunch + errand! [12:11:48] 10Analytics, 10EventBus, 10Reading-Infrastructure-Team-Backlog, 10Trending-Service, and 2 others: Trending Edit's worker offsets disappear from Kafka - https://phabricator.wikimedia.org/T181346#3844372 (10mobrovac) 05Open>03declined The service has been retired from production, so closing. [12:12:04] 10Analytics, 10EventBus, 10Reading-Infrastructure-Team-Backlog, 10Trending-Service, and 2 others: Trending Edit's worker offsets disappear from Kafka - https://phabricator.wikimedia.org/T181346#3844378 (10mobrovac) [13:02:30] elukey, can you ping me when you're back :] [13:05:24] Hi team - I have some time now that NaƩ is sleeping, but I'll take today off - She's sick :( [13:05:40] elukey: if you're back we can brainstorm :) [13:14:51] joal: bit busy now, we can chat tomorrow, nothing urgent.. if I may add a container to your brain scheduler: try to think about what do you need in labs to test hadoop/java8 and I'll try to make it happen :) [13:15:03] mforns: o/ [13:15:27] elukey: Starting my christmas wishlist ;) [13:16:36] :D [13:19:16] elukey, hey! [13:20:03] I'm trying to access puppet logs for hdfs ensures [13:20:09] but no permits [13:20:44] I think hdfs: /var/log/refinery is not successfully ensured [13:23:34] mforns: only root can do it [13:23:42] lemme check in puppet [13:24:01] elukey, am I right to think that /var/log/refinery should exist? [13:24:28] and that its absence is the cause of: /bin/sh: 1: cannot create /var/log/refinery/drop-mediawiki-history.log: Permission denied [13:24:34] ? [13:24:44] ah snap I forgot to follow up on that one! [13:25:07] did it realarm? [13:25:28] last one for me is the 15th of dec [13:25:30] dunno, I was looking into it because the alarm from friday [13:25:31] yes [13:25:39] ok, you already solved it? [13:25:58] nono but it was far down in my todo list, lemme check it now [13:26:00] it seems important [13:26:29] so [13:26:30] elukey@analytics1003:/var/log$ ls -dl refinery/ [13:26:30] drwxrwsr-x 2 hdfs analytics-admins 4096 Dec 4 06:25 refinery/ [13:26:38] elukey, aaaaaaaa [13:27:08] I can not see that... [13:27:25] but [13:27:26] elukey@analytics1003:/var/log$ ls -l /var/log/refinery/drop-mediawiki-history.log [13:27:29] -rw-r--r-- 1 root analytics-admins 394 Nov 15 00:00 /var/log/refinery/drop-mediawiki-history.log [13:27:37] root! [13:27:50] and group has only read [13:28:57] elukey, the ensure command specifies user=hdfs and group=analytics-admins [13:29:18] (for refinery) [13:29:35] elukey, maybe........ [13:29:39] mforns: we usually ensure only directories, not files [13:29:59] yes yes, I was talking about the directory [13:30:00] so each process is free to create them whenever they want [13:30:07] elukey, I think I know what happened [13:30:45] mediawiki-snapshot-cleaner was executing as root before, then we got a ton of errors, and we changed it to hdfs [13:32:02] this one is probably the issue [13:32:03] /var/log/refinery/*.log { size 100M rotate 4 missingok notifempty nocreate su root hdfs [13:32:07] } [13:32:09] horrible paste [13:35:43] hey elukey, sorry I got an internet hiccup [13:36:00] I was saying that I think we can delete that log file, and then let the cleaner recreate it under hdfs [13:37:06] sure, I still have no idea why the logrotate rule for the dir is root/hdfs [13:37:09] probably wrong [13:38:58] (03PS1) 10GoranSMilovanovic: minor [analytics/wmde/WDCM-Overview-Dashboard] - 10https://gerrit.wikimedia.org/r/398833 [13:39:01] anyhow, I have chowned the file to hdfs [13:39:17] elukey, ok, thanks! this should do the trick [13:39:21] ah it runs only the 15th of the month! [13:39:25] yes [13:39:31] mforns: are you going to re-launch it? [13:39:47] no, no, it was not a stoppage I think [13:39:51] oh, wait [13:39:52] (03PS1) 10GoranSMilovanovic: minor [analytics/wmde/WDCM-Usage-Dashboard] - 10https://gerrit.wikimedia.org/r/398834 [13:40:24] (03PS1) 10GoranSMilovanovic: minor [analytics/wmde/WDCM-Semantics-Dashboard] - 10https://gerrit.wikimedia.org/r/398835 [13:40:41] (03CR) 10GoranSMilovanovic: [C: 032] minor [analytics/wmde/WDCM-Overview-Dashboard] - 10https://gerrit.wikimedia.org/r/398833 (owner: 10GoranSMilovanovic) [13:40:45] 10Analytics, 10DBA, 10Patch-For-Review, 10User-Elukey: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#3844569 (10jcrespo) [13:40:48] 10Analytics-Kanban, 10DBA, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3844568 (10jcrespo) [13:40:49] (03CR) 10GoranSMilovanovic: [C: 032] minor [analytics/wmde/WDCM-Usage-Dashboard] - 10https://gerrit.wikimedia.org/r/398834 (owner: 10GoranSMilovanovic) [13:40:56] 10Analytics-Kanban, 10DBA, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3844059 (10jcrespo) a:03jcrespo [13:40:58] (03CR) 10GoranSMilovanovic: [C: 032] minor [analytics/wmde/WDCM-Semantics-Dashboard] - 10https://gerrit.wikimedia.org/r/398835 (owner: 10GoranSMilovanovic) [13:41:04] elukey, I'd say it was only a problem of writing the log, let me check [13:41:33] okok [13:41:41] mforns: totally unrelated thing - https://phabricator.wikimedia.org/T183123# [13:41:50] lookin [13:43:22] elukey, looks good to me, both options [13:43:32] paranoia mode on :D [13:43:47] elukey, wouldn't it be possible to restore data from slave in case of failure? [13:44:03] oh, you mean in case there's a bug in both hosts? [13:44:09] yeah [13:44:12] ok [13:44:27] yea, we can keep the dump for a couple of weeks [13:46:34] mforns: other thought - would it be more consistent if we kept the same purging scheme on both db1108/db1107 ? [13:46:57] elukey, hm [13:47:22] elukey, would it be possible performance-wise? [13:47:33] I'd say so yes [13:47:48] executing batch updates while inserting? [13:48:18] well it happens even for the slave since the eventlogging_sync periodically does those in batch [13:48:37] elukey, I see [13:48:37] and drops would also impact the mysql db performance wise [13:49:44] I'm not experienced at all in this, so yea, if it makes no difference to performance, I'd say yes, let's keep both master and slave the same :] [13:49:46] better [13:51:28] we can test it and see the performance impact [13:51:36] elukey, the mediawiki snapshot cleaner failed indeed, will relaunch it [13:51:40] super [13:51:48] remember to use a screen/tmux session [13:51:54] so you'll be able to detach it [13:52:12] sure [14:01:16] mforns: as FYI, jaime started a mysqldump on db1107 just now that may be a bit aggressive [14:01:31] can you help me whatch el metrics to see if anything slows down too much? [14:01:37] elukey, sure [14:01:52] will look at the mysql_consumer logs [14:02:20] 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3844654 (10jcrespo) Backup is ongoing on db1107:/srv/backups/export-20171218-135659 kill the myd... [14:02:35] 10Analytics-Kanban, 10DBA, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3844655 (10jcrespo) [14:04:26] elukey, the log seems parallized [14:04:46] since 7 minutes ago [14:05:40] mforns: let's stop the mysql consumer then [14:05:59] elukey, yes, do you know how to stop it independently? [14:06:51] never done it.. eventloggingctl something stop? :D [14:07:53] stop eventlogging/consumer NAME=mysql-m4-master-00 CONFIG=/etc/eventlogging.d/consumers/mysql-m4-master-00 ? [14:07:56] elukey, I don't know how to stop the consumer only: sudo eventloggingctl stop stops everything, but I'm not sure how to partilly stop it [14:08:00] lookin [14:08:13] yea, that makes sense [14:08:32] done [14:09:38] elukey, try: sudo eventloggingctl status [14:09:51] it says consumer mysql-eventbus start/running 7794 [14:10:07] yeah, should it also list the other mysql as stopped? [14:10:14] yes? [14:11:14] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m4-master-00 consumer/mysql-eventbus [14:11:50] nice :D [14:12:01] so the eventlogctl doesn't tell if something is stopped [14:12:10] it simply doesn't list it [14:13:06] ok [14:13:45] yall doin eventlogging mysql things? [14:13:48] elukey, did you also stop the mysql-eventbus? [14:13:53] yep [14:14:05] ottomata: yesss sorry [14:14:18] s'ok just saw page and checkin :) [14:15:07] ottomata: yes sorry, jaime is taking a mysql backup of db1107 and we stopped the mysql consumers [14:18:11] (03CR) 10Ottomata: [C: 031] Correct bug in mediawiki raw revision table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/398552 (owner: 10Joal) [14:18:29] ottomata: whenever you have time I have a list of things that happened for kafka [14:18:39] didn't have time to write an email [14:18:47] but I can do it if you want to read it later on [14:19:59] that happened? [14:20:07] hey all [14:21:18] (03CR) 10Milimetric: Correct bug in mediawiki raw revision table (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/398552 (owner: 10Joal) [14:21:19] ottomata: yeah, kafka1023 ready to be among the partition leaders again, vk on cp1008 pushing via TLS to jumbo, chat with brandon, etc.. [14:23:03] heya [14:27:05] OHH cool [14:27:48] (03CR) 10Milimetric: Change addblocker text (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/398598 (https://phabricator.wikimedia.org/T182958) (owner: 10Nuria) [14:27:52] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Enable more accurate smaps based RSS tracking by yarn nodemanager - https://phabricator.wikimedia.org/T182276#3844755 (10elukey) [14:28:18] 10Analytics-Kanban: Refresh SWAP hardware - https://phabricator.wikimedia.org/T183145#3844760 (10Aklapper) [14:33:15] General question to anybody who is on: Several hive tables have either a "project" (en.wikipedia) or a "db" (enwiki). Is there a good reason to use one or the other as a canonical identifier? [14:37:18] Shilad: i don't know the full details, but generally they will correlate, but I think there are some exceptions. [14:37:30] db is explicitly what it is: the name of the mysql database that the wiki uses [14:37:47] project is probably more appropriate, as it is explicitly normalized in some way [14:38:10] Got it. [14:39:17] ottomata: And I think I go back and forth using mediawiki_project_namespace_map, but stripping ".org" from hostname to get project name. [14:39:28] ottomata: Does that seem right to you? [14:40:27] Shilad: i actually don't know, mforns might know more? [14:41:17] Thanks! Context: I am trying to infer project for search queries and the page_info.project field in webrequest is sometimes unavailable, but host is always there. [14:51:55] hi Shilad, mediawiki_project_namespace_map is the right way of translating from hostname to dbname an viceversa [14:51:58] :] [14:52:18] Thanks, mforns. And I can just strip ".org" from hostname to get projectname? [14:52:20] regarding page_info.project in webrequest... not sure [14:52:31] Shilad, yes [14:52:44] mforns: Thanks! [14:52:44] that works [14:52:51] np :] [15:01:43] 10Analytics-Kanban, 10Analytics-Wikistats: [Wikistats2] Add link path to router-link - https://phabricator.wikimedia.org/T183149#3844925 (10mforns) [15:04:14] (03PS1) 10Mforns: Add link path to router-link [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/398854 (https://phabricator.wikimedia.org/T183149) [15:05:10] mforns: lol https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1107&var-network=eth0&from=1513594906656&to=1513605706656 [15:05:30] better: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1107&var-network=eth0&from=now-3h&to=now [15:07:41] also stopped eventlogging_sync on db1108 [15:19:01] (03PS2) 10Mforns: Add link path to router-link [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/398854 (https://phabricator.wikimedia.org/T183149) [15:26:48] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [analytics/aggregator] - 10https://gerrit.wikimedia.org/r/387606 (owner: 10Hashar) [15:36:19] mforns: we are also going to upgrade mariadb + kernel on db1107 [15:36:19] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Support multi DC statsv - https://phabricator.wikimedia.org/T179093#3845141 (10Ottomata) > Regarding 1 [...] for the large traffic cross-dc communication to go over verified protocols like TCP/Kafka, as opposed to UDP/Statsd itself. Indeed, but s... [15:36:21] since now everything is stopped [15:36:30] elukey, ok [15:37:21] (03PS3) 10Mforns: Add link path to router-link [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/398854 (https://phabricator.wikimedia.org/T183149) [15:40:14] (03PS4) 10Mforns: Add link path to router-link [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/398854 (https://phabricator.wikimedia.org/T183149) [15:42:02] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Support multi DC statsv - https://phabricator.wikimedia.org/T179093#3845153 (10Ottomata) Or! We could do like we do for all the other replicated topics, and use DC prefixes, and keep the edge DC kafka cluster routing map. This would be like the a... [15:44:16] elukey, you use a mac right? [15:47:03] yep [15:57:10] mforns: all done! We can re-enable [15:57:19] I am going to re-enabled eventlogging_sync on db1108 first [15:57:20] elukey, awesome :] [15:57:25] k [15:57:26] then the mysql consumers [16:07:40] 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Review the alert message about adblocker preventing AQS requests - https://phabricator.wikimedia.org/T182958#3839781 (10Milimetric) @Trizek-WMF I suggested some alternative language here, I think mentioning scripts is likely confusing for users.... [16:17:07] 10Analytics-Kanban, 10DBA, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3845287 (10jcrespo) Ready to close when @elukey is ready [16:20:41] mforns: ready to renable el? [16:20:48] elukey, yes [16:21:35] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [16:21:59] elukey, seems to work fine! [16:22:08] super! [16:22:09] \o/ [16:22:21] Thanks guys :) [16:22:39] 10Analytics-Kanban, 10DBA, 10User-Elukey: Precautionary backup needed for the log database on db1107 before applying regular purging/sanitization - https://phabricator.wikimedia.org/T183123#3845300 (10elukey) 05Open>03Resolved Everything looks good, thank a lot! [16:22:45] 10Analytics, 10DBA, 10Patch-For-Review, 10User-Elukey: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#3845302 (10elukey) [16:23:30] hi all! just a heads-up, Druid seems to be stumbling on some groupby queries (Pivot splits) maybe from overload or something? For example pageviews hourly, latest 1 day, try grouping/splitting on project or Ua Device Family [16:24:03] Getting 500 internal server errors in the background call from Pivot, and also an error when running the same queries via pydruid from Jupyter [16:24:19] Hi AndyRussG - groupby is known to be slow in Druid [16:24:20] An [16:24:21] Not a blocker for me for anything, just thought I'd mention, in case it's not expected ;) [16:24:34] joal: ah hmm ok [16:24:39] AndyRussG: If you can, use topN and then select the few ones you're interested in [16:24:45] It'll be fast [16:24:49] 10Analytics, 10Pageviews-API: Wikistats Bug : wrong data in Top viewed articles (about frwiki) - https://phabricator.wikimedia.org/T182954#3845317 (10fdans) [16:25:04] very nice that https://grafana.wikimedia.org/dashboard/db/prometheus-druid?orgId=1 shows those slow queries joal ! [16:25:26] joal: ah ok gotcha... yeah these are columns that probably have a pretty big range.... thanks much!!! :) [16:25:33] Hooooo :) Awesome elukey :) [16:26:11] np AndyRussG - This method (topN + multiple timeseries requests) is the strategy used by piv :) [16:26:17] pivot sorry [16:27:14] hmmm [16:28:53] joal: you mean maybe to list the available values on the drop-downs in the Filter area? Just dragging a field into the split area for the 1-day period causes a "timeout" message in the UI, though in the network tab it's a 500 [16:29:39] AndyRussG: Can you send me a link wioth that query? [16:29:44] 10Analytics, 10Analytics-Wikistats: [Wikistats2] The detail page for tops metrics does not indicate time range - https://phabricator.wikimedia.org/T182990#3845327 (10fdans) [16:29:44] yurp! [16:29:46] 10Analytics, 10Pageviews-API: Wikistats Bug : wrong data in Top viewed articles (about frwiki) - https://phabricator.wikimedia.org/T182954#3845329 (10fdans) [16:31:38] 10Analytics, 10Pageviews-API: Filter top pages by namespace/category - https://phabricator.wikimedia.org/T182975#3845335 (10jberkel) [16:31:53] joal: https://goo.gl/FxbyQb [16:32:05] again, not a blocker for any stuff I'm working on just now... [16:32:40] 10Analytics, 10Analytics-Wikimetrics: Layout bug in Safari for detail page - https://phabricator.wikimedia.org/T182821#3845336 (10fdans) Working good on Safari 11! [16:32:57] AndyRussG: I think it's because the UA-Device-Family is big - that's weird though, this should work [16:33:33] 10Analytics, 10Analytics-Wikimetrics: Layout bug in Safari for detail page - https://phabricator.wikimedia.org/T182821#3845338 (10fdans) 05Open>03Resolved a:03fdans [16:35:15] 10Analytics-Kanban, 10Analytics-Wikistats: wikistats rendering bug - https://phabricator.wikimedia.org/T182817#3845342 (10fdans) a:03Milimetric [16:36:50] 10Analytics: Whitelist analytics.wikimedia.org and stats.wikimedia.org in add blockers - https://phabricator.wikimedia.org/T182816#3835518 (10fdans) We've tried this before, we'll try again! [16:36:56] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Support multi DC statsv - https://phabricator.wikimedia.org/T179093#3845346 (10Krinkle) >>! In T179093#3845141, @Ottomata wrote: >> Regarding 1 [...] for the large traffic cross-dc communication to go over verified protocols like TCP/Kafka, as oppo... [16:37:10] joal: yeah... Same happens on the project column btw, but they both work if you trim it down to a 1-hour window [16:37:20] 10Analytics, 10Analytics-Wikimetrics: selecting wikisource.org? - https://phabricator.wikimedia.org/T183154#3845347 (10Nuria) [16:37:53] tried only groupby in Jupyter, I'll check out the topn option :) [16:37:53] 10Analytics-Kanban: Remove request for font.googleapis.com from analytics.wikimedia.org - https://phabricator.wikimedia.org/T182804#3845360 (10fdans) a:03Milimetric [16:38:18] 10Analytics: Remove request for font.googleapis.com from analytics.wikimedia.org - https://phabricator.wikimedia.org/T182804#3845362 (10Milimetric) [16:38:44] yeah did seem like something that users might wish to do and expect to work, not too infrequently [16:38:53] 10Analytics, 10Analytics-Wikistats: Sort "Top Viewed Articles" by views, rather than alphabetically - https://phabricator.wikimedia.org/T182757#3845377 (10fdans) 05Open>03Resolved a:03fdans [16:40:42] Thanks for pinging us AndyRussG - It's unexpected and probably not "groupby" related, but we'll look into it ! [16:42:28] elukey: I think we have something with Druid [16:42:58] 10Analytics, 10Analytics-Wikimetrics: selecting wikisource.org? - https://phabricator.wikimedia.org/T183154#3845347 (10fdans) You can type Sources -> wikisource.org, but you own't find it by default because there's a result limit. Let's not have a limit on the wikiselector for families. [16:43:49] 10Analytics, 10Analytics-Wikistats: When searching for a project language, display a full list of languages - https://phabricator.wikimedia.org/T182960#3845388 (10fdans) [16:43:51] 10Analytics, 10Analytics-Wikimetrics: selecting wikisource.org? - https://phabricator.wikimedia.org/T183154#3845390 (10fdans) [16:44:04] joal: --verbose [16:44:28] Druid is very slow and seems unable to answer queries that previsouly worked [16:44:37] 10Analytics, 10Analytics-Wikistats: When searching for a project language, display a full list of languages - https://phabricator.wikimedia.org/T182960#3839812 (10fdans) [16:45:07] elukey: --^ [16:45:14] druid analytics right? [16:45:33] correct elukey [16:45:37] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Enable more accurate smaps based RSS tracking by yarn nodemanager - https://phabricator.wikimedia.org/T182276#3845395 (10EBernhardson) Thanks! I'll try this out this week and see how things go. [16:46:52] elukey: Have we done anything on druid lately? [16:47:03] joal: one thing to bare in mind is that on analytics druid we have only one broker available for pivot (druid1001) [16:47:06] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10hardware-requests: eqiad: (8) Hadoop expansion - FY 2017 / 2018 - https://phabricator.wikimedia.org/T182628#3845408 (10Milimetric) [16:47:41] Funny thing elukey: hole in realtime processed-events betwewen 14:00 and 14:15 :( [16:47:52] * elukey cries in a corner [16:48:48] elukey: I wonder if it wouldn't be related to some webrequest dashboard open and refreshing very often [16:49:20] elukey: possible related to broker as well, I don't know :( [16:50:11] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Support multi DC statsv - https://phabricator.wikimedia.org/T179093#3845424 (10Ottomata) > Compared to (cross-dc write of kafka, only) Ah yeah, that is this: > Aside from that, in this case, running active/active statsv varnishkafka producers & st... [16:51:21] joal: can we try the same query that appears to be slow from pivot on druid1002/3 ? [16:51:25] i am also checking https://grafana.wikimedia.org/dashboard/db/prometheus-druid?orgId=1&from=now-7d&to=now [16:52:09] 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Review the alert message about adblocker preventing AQS requests - https://phabricator.wikimedia.org/T182958#3845427 (10Trizek-WMF) >>! In T182958#3845262, @Milimetric wrote: > @Trizek-WMF I suggested some alternative language here, I think menti... [16:52:10] elukey: I can try queries, just need to find them :) [16:52:20] elukey: picot is bhorium, right? [16:52:27] s/picot/pivot [16:52:35] nope, thorium [16:52:38] elukey, EL seems fine still, catching up on insertions! There are some duplicate entries... don't know exactly why: it can be because of small kafka offset differences [16:52:40] Arf [16:52:47] this has happened before and I think we're fine [16:53:53] mforns: yeah :( [16:54:39] elukey: Looks like pivot is scanning druid multiple times per minute :) [16:54:49] elukey, I guess when we stopped the consumer, because of extreme mysql conditions, the consumer blocked and was not able to sync offsets with kafka [16:55:05] elukey: however, no druid query in pivot logs - looking for broker logs then [16:55:12] we might also have lost some buffered data? maybe? but very few I'd say [16:55:46] mforns: well I'd expect the consumer not to commit anything until it pushes it mysql no? [16:56:13] elukey, mhhh.... no, because there are biggish buffers [16:56:23] I think it commits every 2 seconds or so [16:57:09] mforns: at this point I guess that it uses an error topic or something similar? [16:57:36] elukey, when? [16:59:37] I don't think small differences in kafka offsets are a problem, we might have dropped a couple events because of that, but very few, like orders of magntitude below regular event input [17:00:10] I guess the only problem is EventLogging consumer buffers, when we stopped the consumer, they were lost I guess... [17:00:37] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10hardware-requests: eqiad: (8) Hadoop expansion - FY 2017 / 2018 - https://phabricator.wikimedia.org/T182628#3845436 (10faidon) p:05Triage>03High [17:00:51] I can check in the db for holes when the insertion rate has stabilized [17:01:00] mforns: ack! [17:01:24] mforns: I thought that events in the buffer not pushed to mysql would have gone to a kafka topic error [17:01:47] elukey, oh! that would be a good idea :] does EL do that? [17:02:03] joal: thanks likewise :) o/ [17:03:02] elukey: I can't say what, but something has changed lately in Druid [17:03:46] elukey: a Dashboard I was running with superset doesn't work anymore (timeouts) [17:04:06] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Support multi DC statsv - https://phabricator.wikimedia.org/T179093#3845440 (10Krinkle) >>! In T179093#3845424, @Ottomata wrote: > More thoughts about ^: in my conversation with @bblack, I had originally wanted to more simply use cache traffic rou... [17:05:47] milimetric: another question: Do we cancel the meeting with Erik tonight, or are there anything I forget and we should discuss? [17:10:48] joal: so it should be related to broker/historicals right? [17:10:50] elukey: timeout errors in druid broker :( [17:13:52] so query=GroupByQuery{dataSource='pageviews [17:13:53] -hourly', querySegmentSpec=LegacySegmentSpec{intervals=[2017-12-11T00:00:00.000Z/2017-12-18T17:03:08.000Z]} [17:13:57] triggered a timeout [17:14:03] yup [17:14:15] I'm surprised about the fact it's a groupby [17:14:34] the fault seems to be org.jboss.netty.handler.timeout.ReadTimeoutException [17:15:02] and I guess, checking from the latency metrics, that we have 10s no [17:15:04] *now [17:15:15] elukey: I looked at data-size (pageview-hourly segments for 2017-12-17+18): less than 4Gb - This should be super easy [17:18:36] well not if not in cache right ? [17:19:21] elukey: druid would have to read the segments, but as said - 4G, and the disks are SSDs, timeout doesn't sound right [17:19:50] elukey: I don't know how, but it seems druid swaps [17:20:05] elukey: see mem on druid1001 [17:20:11] free sorry [17:20:38] joal: do you remember that I mentioned the use of a huge heap vs few cached memory for the os ? [17:20:41] :D [17:20:45] :D [17:21:16] elukey: swap used on both druid1001 and druid1003 [17:22:11] elukey: And given the cache-size used from the metrics you sent, we better release most of that memory to the system [17:23:34] joal: we'd probably need to make some calculations and/or review the Xms/Xmx settings (probably only the former) [17:23:46] elukey: agreed [17:24:05] elukey: This still seems weird - things that were working yesterday don't today [17:24:27] I wonder if there are new regular usage we dont know of [17:25:26] it could be yes [17:27:25] it is also weird that I can't find those 10s of read timeout anywhere [17:30:33] elukey: Ok, I suppose we're gonna talk about that first thing tommorow [17:31:30] yep! [17:31:37] k [17:31:52] Gone for diner a-team, back after [17:42:52] (03PS13) 10Fdans: Add pageview by country oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521) [18:18:12] hey all, got another quick question here... Where could I see the code or read some cannonical doc about the contents of the 'project' field in Pageview hourly? https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly [18:18:25] Perhaps somewhere in the wmf config repo? [18:19:10] That column seems is also in the pageview_info map in Webrequest https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest [18:20:27] Just wishing to find specifically the cannonical place we set up which sites would take a language prefix there and which wouldn't [18:20:54] I gues from an analytics perspective it comes directly from the URL somehow? [18:21:13] (and then the for the real setup I should check with operaitons)? [18:29:01] AndyRussG: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java#L271 [18:29:27] AndyRussG: We compute the project from URLs indeed [18:30:02] joal: cool, thanks so much!!! :D [18:30:24] AndyRussG: And the project field in pageview_hourly is the exact same as pageview_info['project'] in webrequest (see https://github.com/wikimedia/analytics-refinery/blob/master/oozie/pageview/hourly/pageview_hourly.hql#L34) [18:31:08] AndyRussG: The thing to keep in mind when using this field in webrequest table is that it is corrextly populated on for rows having is_pageview = TRUE [18:31:32] A value may exist when is_pageview = false, but it definitely could be wrong [18:33:45] joal: yeah really looking at pageviews [18:35:09] AndyRussG: If you're after project-normalization data in webrequest, normalized_host (see https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#L245 and https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/refine_webrequest.hql#L103) [18:39:18] joal: nice...! Yeah, webrequest.normalized_host != pageview_hourly.project [18:39:26] correct AndyRussG :) [18:41:47] * elukey off! [19:22:33] elukey: joal: yeah Pivot now borking on simple queries, here's Pageviews Hourly for the last week, no filters or splits: https://goo.gl/W3fNEZ [19:23:29] AndyRussG: elukey is gone for tonight, we plan to take care of that tomorrow :) [19:24:26] joal: K, thanks so much... Yea just thought I'd mention in case it's useful..... :) [19:27:06] Thanks AndyRussG :) [21:04:18] 10Analytics, 10Analytics-Wikistats: Link to 'more info' doesn't always work - https://phabricator.wikimedia.org/T183188#3846199 (10Erik_Zachte) [21:05:49] 10Analytics, 10Analytics-Wikistats: Display of radio buttons in Wikistats 2 is somewhat confusing - https://phabricator.wikimedia.org/T183185#3846212 (10Erik_Zachte) a:03Milimetric [21:06:23] 10Analytics, 10Analytics-Wikistats: Make the colors used the line charts in Wikistats 2 more easy to recognize. - https://phabricator.wikimedia.org/T183184#3846218 (10Erik_Zachte) a:03Milimetric [21:06:55] 10Analytics, 10Analytics-Wikistats: Present Wikistats 2 charts for the period selected by the user. - https://phabricator.wikimedia.org/T183183#3846222 (10Erik_Zachte) a:03Milimetric [21:08:09] 10Analytics, 10Analytics-Wikistats: Consistently preserve settings when a user switches to a new metric (especially on the same page). - https://phabricator.wikimedia.org/T183181#3846228 (10Erik_Zachte) a:03Milimetric [21:09:01] 10Analytics, 10Analytics-Wikistats: Consistently preserve settings when a user switches to a new metric (especially on the same page). - https://phabricator.wikimedia.org/T183181#3846232 (10Erik_Zachte) @Catrope sorry, I added subscribers and project [21:18:46] 10Analytics, 10Analytics-Wikistats: Please add download option 'as csv file' to Wikistats 2 - https://phabricator.wikimedia.org/T183192#3846270 (10Erik_Zachte) [22:20:47] (03PS1) 10Ottomata: Add _REFINE_FAILED failure flag and skip refinement if it exists [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/399105 [22:22:02] (03PS2) 10Ottomata: Add _REFINE_FAILED failure flag and skip refinement if it exists [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/399105 [23:33:41] 10Analytics, 10Analytics-Wikistats: roadmap of migration to Wikistats 2 - https://phabricator.wikimedia.org/T183180#3846772 (10Aklapper) Assuming this is about #analytics-wikistats