[01:44:47] 10Analytics, 10Analytics-EventLogging, 10Page-Previews, 10Readers-Web-Backlog, 10Readers-Web-Kanbanana-Board: Some VirtualPageView are too long and fail EventLogging processing - https://phabricator.wikimedia.org/T196904 (10Ryasmeen) Verified on Beta both through Special: AllPages and on regular articles... [01:44:49] 10Analytics, 10Analytics-EventLogging, 10Page-Previews, 10Readers-Web-Backlog, 10Readers-Web-Kanbanana-Board: Some VirtualPageView are too long and fail EventLogging processing - https://phabricator.wikimedia.org/T196904 (10Ryasmeen) Verified on Beta both through Special: AllPages and on regular articles... [04:55:46] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: EventLogging in Hive data loss due to Camus and Kafka timestamp.type=CreateTime change - https://phabricator.wikimedia.org/T198906 (10Ottomata) [04:55:49] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: EventLogging in Hive data loss due to Camus and Kafka timestamp.type=CreateTime change - https://phabricator.wikimedia.org/T198906 (10Ottomata) [05:35:25] morning! [05:44:23] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Deploy refinery to eventlogging hosts - https://phabricator.wikimedia.org/T198766 (10elukey) a:03elukey [05:44:31] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Deploy refinery to eventlogging hosts - https://phabricator.wikimedia.org/T198766 (10elukey) a:03elukey [05:45:00] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Deploy refinery to eventlogging hosts - https://phabricator.wikimedia.org/T198766 (10elukey) We could also think about moving the Python script (currently in python) to the refinery, and reference it from puppet? [05:45:09] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Deploy refinery to eventlogging hosts - https://phabricator.wikimedia.org/T198766 (10elukey) We could also think about moving the Python script (currently in python) to the refinery, and reference it from puppet? [05:47:24] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Deploy refinery to eventlogging hosts - https://phabricator.wikimedia.org/T198766 (10elukey) @Marostegui, @jcrespo: we'd need to deploy the [[ https://github.com/wikimedia/analytics-refinery/blob/master/bin/sqoop-mediawiki-tables | Analytics Refinery ]] reposit... [05:47:27] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Deploy refinery to eventlogging hosts - https://phabricator.wikimedia.org/T198766 (10elukey) @Marostegui, @jcrespo: we'd need to deploy the [[ https://github.com/wikimedia/analytics-refinery/blob/master/bin/sqoop-mediawiki-tables | Analytics Refinery ]] reposit... [06:29:19] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Deploy refinery to eventlogging hosts - https://phabricator.wikimedia.org/T198766 (10Marostegui) >>! In T198766#4402306, @elukey wrote: > @Marostegui, @jcrespo: we'd need to deploy the [[ https://github.com/wikimedia/analytics-refinery | A... [06:29:31] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Deploy refinery to eventlogging hosts - https://phabricator.wikimedia.org/T198766 (10Marostegui) >>! In T198766#4402306, @elukey wrote: > @Marostegui, @jcrespo: we'd need to deploy the [[ https://github.com/wikimedia/analytics-refinery | A... [06:42:46] * elukey brb [07:08:47] morninf elukey [07:12:57] o/ [07:18:22] (03PS15) 10Joal: Add MediawikiHistoryChecker spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/439869 (https://phabricator.wikimedia.org/T192481) [07:26:12] (03CR) 10jerkins-bot: [V: 04-1] Add MediawikiHistoryChecker spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/439869 (https://phabricator.wikimedia.org/T192481) (owner: 10Joal) [07:32:45] (03PS16) 10Joal: Add MediawikiHistoryChecker spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/439869 (https://phabricator.wikimedia.org/T192481) [08:32:25] 10Analytics: Erros with the new SWAP notebooks - https://phabricator.wikimedia.org/T198909 (10JAllemandou) removing the "SQL-Context not loaded" part from the description, as spark 2.3 uses `SparkSession` named `spark` instead of a sql-context. the `sqlContext` is still usable through `spark.sqlContext`, but mos... [08:32:30] 10Analytics: Erros with the new SWAP notebooks - https://phabricator.wikimedia.org/T198909 (10JAllemandou) removing the "SQL-Context not loaded" part from the description, as spark 2.3 uses `SparkSession` named `spark` instead of a sql-context. the `sqlContext` is still usable through `spark.sqlContext`, but mos... [08:34:57] 10Analytics: Erros with the new SWAP notebooks - https://phabricator.wikimedia.org/T198909 (10JAllemandou) [08:35:06] 10Analytics: Erros with the new SWAP notebooks - https://phabricator.wikimedia.org/T198909 (10JAllemandou) [09:07:34] 10Analytics: Erros with the new SWAP notebooks - https://phabricator.wikimedia.org/T198909 (10Peachey88) [09:07:38] 10Analytics: Erros with the new SWAP notebooks - https://phabricator.wikimedia.org/T198909 (10Peachey88) [09:44:11] 10Analytics, 10EventBus, 10MassMessage, 10MediaWiki-JobQueue, and 2 others: Global mass message delivered on meta but not on other wikis? - https://phabricator.wikimedia.org/T195500 (10mobrovac) p:05Unbreak!>03High [09:44:16] 10Analytics, 10EventBus, 10MassMessage, 10MediaWiki-JobQueue, and 2 others: Global mass message delivered on meta but not on other wikis? - https://phabricator.wikimedia.org/T195500 (10mobrovac) p:05Unbreak!>03High [09:58:48] so today is varnishkafka love friday :) [09:59:19] I found an interesting failure scenario, that seems to happen only for varnishkafka eventlogging instances in singapore caching hosts [09:59:22] https://grafana.wikimedia.org/dashboard/db/varnishkafka?orgId=1&from=now-7d&to=now&var-instance=eventlogging&var-host=All [10:00:16] elukey: What happens? [10:00:42] I started looking to delivery report errors [10:00:54] they happen once in a while for cp5* hosts [10:01:00] (singapore) [10:01:09] and they match with increase in latency and conn timeouts [10:01:18] hm [10:01:18] on the vk side, there are a lot of conn timeouts [10:01:23] to a specific broker each time [10:01:48] elukey: no kafka in singapore, therefore long latencies for connection - right? [10:02:14] yeah we have only in eqiad.. https://wikitech.wikimedia.org/wiki/Network_design#/media/File:Wikimedia_network_overview.png [10:02:53] it can take two paths, but basically it is either singapore sfo dallas or singapore sfo chicago [10:02:56] and then eqiad [10:03:31] so I think that given the higher latency of the link, the vk eventlogging instance sometimes has issues [10:03:34] and we drop events [10:05:19] (03PS7) 10Joal: [WIP] Add validation step in mediawiki-history jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/440005 (https://phabricator.wikimedia.org/T192481) [10:08:23] I am leaning towards socket.timeout.ms [10:08:38] hm [10:09:05] When you say "vk eventlogging instance sometimes has issues" - You mean too high latency, or something elsE? [10:09:32] no no I mean delivery reports errors [10:09:36] == dropping event [10:09:56] Makes sense [10:10:31] Now it orders the right way in my mind: high latency --> vk-EL issues --> Drop [10:10:34] Sorry [10:12:58] no need, as always I am trying to get your opinion about something that I am not sure :) [10:13:15] so the socket.timeout.ms is interesting [10:13:17] "Default timeout for network requests. Producer: ProduceRequests will use the lesser value of socket.timeout.ms and remaining message.timeout.ms for the first message in the batch." [10:13:32] message.timeout.ms in theory should be 5mins for vk [10:13:45] and we don't set message.timeout.ms, so should be the default of 1min [10:14:42] checking Wikimedia_network_overview.png there is definitely a high baseline of latency, comprared to the other dcs [10:15:40] I checked one use case, and cp5012 had issues talking with kafka-jumbo1001, timeouts and high latencies.. in jumbo's logs I found a consumer group rebalance in progress, so I am wondering if other things happening on the broker might influence vk [10:16:05] event batches stacking up, the local delivery queue builds up [10:16:15] then the 1min timeout is easy to hit [10:17:14] joal take a lot at this [10:17:16] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=25&fullscreen&edit&orgId=1&from=now-7d&to=now&var-instance=eventlogging&var-host=cp4* [10:17:20] note the cp4* (ulsof) [10:17:32] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=25&fullscreen&edit&orgId=1&from=now-7d&to=now&var-instance=eventlogging&var-host=cp5* [10:17:38] cp5* (eqsin) [10:18:28] not sure why it happens only with eventlogging though [10:18:29] Wow [10:18:36] Huge spikes [10:19:00] higher average, but the spikes are scary [10:19:46] yes the baseline is ~250ms [10:35:47] so I think that what happens is that socket.timeout.ms kicks in after 60s of inactivity on the socket, and then the local queue timeouts following are explained as well [10:36:12] so it might be worth to raise socket.timeout.ms to say 120s or more in cp5* hosts [10:36:20] and see if anything changes [10:36:52] makes sense elukey - Does that mean that we see inactivity on EL for 1min in singapore? [10:41:08] joal: I think that it is possible that sometimes a broker builds up a queue of things to do, and may slow down producers/consumers... [10:41:30] to the point that they hit the 1m timeout [10:41:39] and drop a lot of events [10:41:44] elukey: I think I have never heard of that for kafka, but very possible nonetheless ! [10:42:17] it may be a combination of various things, I am not blaming directly kafka [10:42:26] (03CR) 10Joal: "2 questions for confirmation:" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/443069 (https://phabricator.wikimedia.org/T193641) (owner: 10Jonas Kress (WMDE)) [10:42:27] and let's remember that it happens on the link with higher latency [10:42:36] yes [10:42:42] so it could be a combination of link latency spikes + broker's latency itself [10:45:25] going afk for a couple of hours, will restart in the afternoon :) [10:49:58] (03CR) 10Joal: "Minor nits - Looks almoxst good. Let's make sure it is tested before merging adn deploy. Thanks :)" (036 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/443409 (https://phabricator.wikimedia.org/T193641) (owner: 10Jonas Kress (WMDE)) [11:57:15] 10Analytics: Errors with the new SWAP notebooks - https://phabricator.wikimedia.org/T198909 (10Reedy) [11:57:24] 10Analytics: Errors with the new SWAP notebooks - https://phabricator.wikimedia.org/T198909 (10Reedy) [12:49:40] 10Analytics, 10Analytics-EventLogging, 10Readers-Web-Backlog, 10Reading-Infrastructure-Team-Backlog, 10Epic: Explore an API for logging events sampled by session - https://phabricator.wikimedia.org/T168380 (10Jhernandez) [12:49:46] 10Analytics, 10Analytics-EventLogging, 10Readers-Web-Backlog, 10Reading-Infrastructure-Team-Backlog, 10Epic: Explore an API for logging events sampled by session - https://phabricator.wikimedia.org/T168380 (10Jhernandez) [13:35:08] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Varnishkafka eventlogging instances delivery failures - https://phabricator.wikimedia.org/T198070 (10elukey) Today I reviewed the varnishkafka grafana dashboard and I saw an ongoing pattern of sporadic delivery error reports for the varnis... [13:35:10] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Varnishkafka eventlogging instances delivery failures - https://phabricator.wikimedia.org/T198070 (10elukey) Today I reviewed the varnishkafka grafana dashboard and I saw an ongoing pattern of sporadic delivery error reports for the varnis... [13:38:03] do we have two bots --^? [13:41:12] elukey: maybe? [13:41:20] I can see only wm-bot2 [13:41:25] but I have no idea how it works [13:41:36] Maybe the bot is tired of us not reading the thing, so it posts it twice? [13:44:36] hey teaaam [13:44:42] joal: I found an interesting thing for vk [13:44:58] the traffic from Singapore is way more than the SFO ones [13:45:06] but less than Amsterdam [13:45:30] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Display of radio buttons in Wikistats 2 is somewhat confusing - https://phabricator.wikimedia.org/T183185 (10sahil505) p:05Triage>03Normal [13:45:31] It may also be incorrect batching on our side now that I think about it [13:45:33] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Display of radio buttons in Wikistats 2 is somewhat confusing - https://phabricator.wikimedia.org/T183185 (10sahil505) p:05Triage>03Normal [13:46:15] (03PS17) 10Joal: Add MediawikiHistoryChecker spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/439869 (https://phabricator.wikimedia.org/T192481) [14:11:20] 10Analytics, 10Analytics-Wikistats: Make the colors used the line charts in Wikistats 2 more easy to recognize. - https://phabricator.wikimedia.org/T183184 (10mforns) - keep 3 colors designed for sections: reading, contributing, content (we can use different shades if it looks better) - remove black border fro... [14:11:25] 10Analytics, 10Analytics-Wikistats: Make the colors used the line charts in Wikistats 2 more easy to recognize. - https://phabricator.wikimedia.org/T183184 (10mforns) - keep 3 colors designed for sections: reading, contributing, content (we can use different shades if it looks better) - remove black border fro... [14:14:29] 10Analytics, 10Analytics-Wikistats: Improve Wikistats2 map zoom - https://phabricator.wikimedia.org/T198867 (10mforns) if the first problem "diagonal zoom" takes a lot of time to solve, don't bother [14:14:32] 10Analytics, 10Analytics-Wikistats: Improve Wikistats2 map zoom - https://phabricator.wikimedia.org/T198867 (10mforns) if the first problem "diagonal zoom" takes a lot of time to solve, don't bother [14:21:25] 10Analytics: Fix sqoop script so that the jar-generation step doesn't print logs (alerts email sent by cron) - https://phabricator.wikimedia.org/T198966 (10Aklapper) [14:21:26] 10Analytics: Fix sqoop script so that the jar-generation step doesn't print logs (alerts email sent by cron) - https://phabricator.wikimedia.org/T198966 (10Aklapper) [14:24:00] 10Analytics, 10Analytics-Wikistats: Improvements to Wikistats2 chart popups - https://phabricator.wikimedia.org/T192416 (10mforns) - have the popup be the same component for all charts, that receives the data. libraries that can help: - C3 - http://nvd3.org/ - look at highcharts for inspiration [14:24:01] 10Analytics, 10Analytics-Wikistats: Improvements to Wikistats2 chart popups - https://phabricator.wikimedia.org/T192416 (10mforns) - have the popup be the same component for all charts, that receives the data. libraries that can help: - C3 - http://nvd3.org/ - look at highcharts for inspiration [14:27:55] joal: [14:28:00] I found something interesting [14:28:09] after "only" a few hours of pain [14:28:16] https://gerrit.wikimedia.org/r/444232 [14:32:34] now as far as I can read, enabling this means that the messages will be compressed in the topics with snappy [14:32:41] and the consumers will decompress them [14:32:54] need to wait for Andrew's opinion before pulling the trigger :) [14:41:21] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Varnishkafka eventlogging instances delivery failures - https://phabricator.wikimedia.org/T198070 (10elukey) So the first attempt is to introduce the Snappy compression for vk eventlogging, that in theory should reduce a lot the message si... [14:41:23] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Varnishkafka eventlogging instances delivery failures - https://phabricator.wikimedia.org/T198070 (10elukey) So the first attempt is to introduce the Snappy compression for vk eventlogging, that in theory should reduce a lot the message si... [15:44:03] interesting elukey! I knew we were compressing for webrequest (i actually caused a problem because of a bug a few years back), I thought we had it enabled by default everywhere! [15:45:06] joal: maybe it wasn't necessary when el was not handling that much of a traffic, but probably we are hitting some limits now [15:45:21] elukey: in any case, compression for data travelling makes sense! [15:46:04] I wouldn't take a plane with all my clothes over hangers :) [15:46:34] ahhahahah [15:47:00] joal: the thing that I am not sure about is what changes will it bring downstream [15:47:12] to those poor consumers [15:47:45] elukey: some more CPU usage :) [15:47:51] And less bandwidth ;) [15:48:58] hopefully also no bugs leading to consumers horribly dying for messages compressed not read etc.. [16:00:45] ping fdans milimetric [16:02:47] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Upgrade SWAP's JupyterLab from beta 1 to beta 2 - https://phabricator.wikimedia.org/T198738 (10Nuria) a:03Ottomata [16:06:29] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Productionize EventLogging sanitization - https://phabricator.wikimedia.org/T193176 (10mforns) [16:06:55] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Partially purge MobileWikiAppiOSUserHistory eventlogging schema - https://phabricator.wikimedia.org/T195269 (10mforns) [16:15:14] (03CR) 10Fdans: [C: 031] Make bar-chart and line-chart resilient to breakdowns with null values [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443640 (https://phabricator.wikimedia.org/T198630) (owner: 10Mforns) [16:38:20] 10Analytics, 10Analytics-EventLogging, 10Page-Previews, 10Readers-Web-Backlog, 10Readers-Web-Kanbanana-Board: Some VirtualPageView are too long and fail EventLogging processing - https://phabricator.wikimedia.org/T196904 (10ABorbaWMF) Thanks, @Ryasmeen [16:42:33] 10Analytics, 10Analytics-EventLogging, 10Page-Previews, 10Readers-Web-Backlog, 10Readers-Web-Kanbanana-Board: Some VirtualPageView are too long and fail EventLogging processing - https://phabricator.wikimedia.org/T196904 (10phuedx) >>! In T196904#4403947, @ABorbaWMF wrote: > Thanks, @Ryasmeen Seconded! [16:51:11] (03PS8) 10Joal: [WIP] Add validation step in mediawiki-history jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/440005 (https://phabricator.wikimedia.org/T192481) [16:53:06] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10Patch-For-Review: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) >>! In T198623#4397016, @elukey wrote: > I am pretty sure that this is a pre-scap thing, we should drop it :) Great! > Other thi... [16:59:44] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10Patch-For-Review: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) @ayounsi are we sure that we can touch common-infrastructure4 without affecting anything else? Is there any trace of who made it... [17:03:53] * elukey off! [17:20:19] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10Patch-For-Review: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) >>! In T198623#4403989, @elukey wrote: > @ayounsi are we sure that we can touch common-infrastructure4 without affecting anythi... [17:58:51] 10Analytics: Sqoop more tables for mediawiki in monthly schedule - https://phabricator.wikimedia.org/T198983 (10Nuria) [17:59:04] 10Analytics: Sqoop more tables for mediawiki in monthly schedule - https://phabricator.wikimedia.org/T198983 (10Nuria) [18:05:05] 10Analytics, 10Analytics-Kanban: Problems with external referrals? - https://phabricator.wikimedia.org/T195880 (10JKatzWMF) Hi @Nuria just following up on this to see if you have had a chance to take a look. [18:08:59] 10Analytics: Alarms on Webrequest data - https://phabricator.wikimedia.org/T198985 (10Nuria) [18:14:21] 10Analytics: Data Quality Alarms - https://phabricator.wikimedia.org/T198986 (10Nuria) [18:14:34] 10Analytics: Data Quality Alarms - https://phabricator.wikimedia.org/T198986 (10Nuria) [18:14:36] 10Analytics: Alarms on Webrequest data - https://phabricator.wikimedia.org/T198985 (10Nuria) [18:15:01] 10Analytics, 10cloud-services-team (Kanban): Throughput alarms on refined data - https://phabricator.wikimedia.org/T198908 (10Nuria) [18:15:03] 10Analytics, 10Analytics-EventLogging: Alarm on errors on /var/log/upstart/eventlogging* files - https://phabricator.wikimedia.org/T170620 (10Nuria) [18:19:15] 10Analytics: Alarms in Eventlogging hadoop sanitization - https://phabricator.wikimedia.org/T198910 (10Nuria) [18:19:43] 10Analytics, 10cloud-services-team (Kanban): Alarms on throughput on refined data - https://phabricator.wikimedia.org/T198908 (10Nuria) [18:19:43] o/, checking up on the el hive stuff [18:20:49] looks pretty good on the camus / raw side [18:20:59] nuria_, ottomata I checked all camus'ed partitions and they look good! [18:21:04] great [18:21:06] ok [18:21:11] so i will launch a refine [18:21:17] after we refine everything, they I'll check for volume [18:21:28] cool [18:21:29] on meeting , can talk in a bit [18:21:58] ottomata, eventlogging_EventError is the only one that is missing [18:22:52] because it could not be backfilled from logs [18:24:43] oh right [18:24:44] that's ok tho [18:29:06] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: EventLogging in Hive data loss due to Camus and Kafka timestamp.type=CreateTime change - https://phabricator.wikimedia.org/T198906 (10Ottomata) Running refine to backfill the last 552 hours. Camus imported hours with new data will be re-refined. As hdfs... [18:33:59] mforns: soooo [18:34:15] mforns: was the alarm we got about " was not able to refine .." blah [18:34:33] nuria_, I still do not understand that [18:34:54] mforns: ok, let's try to understand that too give me a sec [18:36:02] 10Analytics: Data Quality Alarms - https://phabricator.wikimedia.org/T198986 (10Nuria) We need to have alarms (coarse at first) to make sure that none of our pipelines is dropiing in throughput. These ideally should come from graphite metrics. [18:36:39] 10Analytics, 10cloud-services-team (Kanban): Alarms on throughput on refined data - https://phabricator.wikimedia.org/T198908 (10Nuria) [18:37:29] 10Analytics: Alarms on Webrequest data processing and pageview volume - https://phabricator.wikimedia.org/T198985 (10Nuria) [18:38:12] data partitions are there as expected in /wmf/data/raw/eventlogging [18:39:18] ottomata: "do you know where the alarm 'The following dataset targets in /wmf/data/raw/eventlogging between 2018-07-05T00:15:03.324Z and 2018-07-06T00:15:03.328Z have not yet been refined to /wmf/data/event:" comes from? [18:40:10] mforns: even for the ones mentioned like "MultimediaViewerDimensions` (year=2018,month=7,day=5,hour=21)" [18:40:16] nuria_, yes [18:40:32] mforns: let me see date of those partitions being created one sec [18:40:54] nuria_, there are several dates, like the data was added progressively [18:41:49] nuria_, in the case of eventlogging_MobileWikiAppSavedPages for example, all files are previous to alert email [18:47:40] 10Analytics: Alarms on Webrequest data processing and pageview volume - https://phabricator.wikimedia.org/T198985 (10Nuria) [18:49:29] there is a recurring pattern for the alarmed schemas where for 2018/07/05/21 there are several small files (6) whereas for 2018/07/05/22 and 2018/07/05/23 there's only one file [18:49:45] nuria yes [18:49:48] that comes from the org.wikimedia.analytics.refinery.job.refine.RefineMonitor job [18:49:59] also spark, uses some of the same code [18:50:40] it might be then that refine was busy refining the last 7 days and got some lag? [18:50:59] so that the refine monitor identified those hours that were missing? [18:52:12] possible yeah, since we imported the last 7 days of camus logs for all of those topics [18:52:31] plus, the alarmed schemas/hours seem to be refined now [18:52:39] ottomata1, mforns : files on /wmf/data/event/MultimediaViewerDimensions/year=2018/month=7/day=5/hour=21 [18:52:47] yea [18:52:48] are smaller than siumilar files for that schema [18:52:59] mforns: overall dir size is about same [18:53:04] oops, i accidentally ran the backfill refine with spark 1 and old refinery jar, do to non cleaned up puppet files that snuck past me [18:53:06] just cleaned them up [18:53:10] the old job should work fine too though [18:53:26] yes, nuria_ raw/eventlogging files are also like that [18:53:41] yargh oo [18:53:44] unless it doesn't deuplicate... [18:53:45] hmmmm [18:53:58] shoot [18:54:03] ottomata1, we can check that once it's refined [18:55:08] i think it will best if i stop this one [18:55:18] 10Analytics: Alarms in Eventlogging hadoop sanitization - https://phabricator.wikimedia.org/T198910 (10Nuria) [18:55:27] and launch the proper one (sorry about that, there were wrapper scripts that should have been deleted from the old job that i thought were the proper ones to copy/paste) [18:56:09] ottomata1, do you want me to delete any data prior to you running that? [18:56:13] ottomata1: what is the ticket with issue again if you have it handy? [18:56:16] i think i need to stop this one and relaunch [18:56:29] yeah, mforns but it will now be hard to target which ones need rerefined.... [18:56:30] shoot [18:57:02] i think maybe we should just re-refine everything june 14 - june 28 [18:57:04] just run everything from 2018/06/14T00 to 2018 [18:57:06] yea [18:57:19] yeah ok, let's just delete the _REFEINED flags everywhere in there and then do that [18:57:20] on it... [18:57:22] bc for deletion? [18:57:37] ok [18:57:43] ya one min [18:57:58] mforns: actually e-mail is from when data is being refined 4:30 am utc [19:03:26] 10Analytics: Alarms in Eventlogging hadoop sanitization - https://phabricator.wikimedia.org/T198910 (10Nuria) See an event that we should ahve gotten alarams For: https://phabricator.wikimedia.org/T198906 [19:21:10] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: EventLogging in Hive data loss due to Camus and Kafka timestamp.type=CreateTime change - https://phabricator.wikimedia.org/T198906 (10Ottomata) Scratch that previous command, I got that out of an old now unpuppetized wrapper script for the old 'json' based... [19:42:12] 10Analytics, 10Wikimedia-Stream: EventStreams butcher up some Unicode characters - https://phabricator.wikimedia.org/T198994 (10stjn) [19:56:57] (03PS9) 10Joal: Add validation step in mediawiki-history jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/440005 (https://phabricator.wikimedia.org/T192481) [21:54:56] 10Analytics, 10Product-Analytics, 10Reading-analysis: Assess impact of ua-parser update on core metrics - https://phabricator.wikimedia.org/T193578 (10Tbayer) >>! In T193578#4300284, @fdans wrote: > @Tbayer ok, so getting IE traffic on these countries for the months previous to the update, we can see that th...