[04:25:52] PROBLEM - Check the last execution of monitor_sanitize_eventlogging_analytics on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_sanitize_eventlogging_analytics [05:57:01] morning! [05:58:25] lovely amount of alarms [06:02:19] !log kill + re-run of pageviews hourly 30-03 hour 7 - seems stuck in heart beat after reduce completed [06:02:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:13:20] * elukey thinks if trusting Oozie's alerts on the first of april is ok [06:44:18] https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&panelId=50&fullscreen&from=now-2d&to=now [06:44:32] first of the month sqoop jobs kicking in :D [06:46:50] as test, I tried to restart the monitor el sanitize [06:49:32] RECOVERY - Check the last execution of monitor_sanitize_eventlogging_analytics on an-coord1001 is OK: OK: Status of the systemd unit monitor_sanitize_eventlogging_analytics [06:58:03] mmmmm [07:11:45] so after re-running it [07:11:47] No dataset targets in /wmf/data/event between 2019-03-31T02:46:31.921Z and 2019-04-01T02:46:31.922Z need refinement to /wmf/data/event_sanitized [07:12:19] but the alarm says: 2019-03-31T00:15:09.633Z and 2019-04-01T00:15:09.634Z [07:12:46] ahhh ok so starting date is not ok [07:15:13] restarted it with a different start time [07:15:27] I didn't see any failure in the corresponding sanitize job [07:15:37] so I am wondering if it was for the cluster being busy [07:16:52] No dataset targets in /wmf/data/event between 2019-03-30T15:14:44.172Z and 2019-04-01T03:14:44.172Z need refinement to /wmf/data/event_sanitized [07:17:28] aaaahhhh wrong date again [07:24:55] (03PS16) 10Elukey: Add artifacts for Debian Buster and upgrade to 0.31.0rc20-wikimedia1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/495182 (https://phabricator.wikimedia.org/T212243) [07:31:12] this time 2019-03-29T09:17:55.471Z and 2019-04-01T03:17:55.472Z, all good [08:36:12] (03PS17) 10Elukey: Add artifacts for Debian Buster and upgrade to 0.31.0rc20-wikimedia1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/495182 (https://phabricator.wikimedia.org/T212243) [08:39:00] hey joal, are you around? [08:41:27] here you have a dump with all geotagged wikidataitems https://tools.wmflabs.org/wikidata-analysis/20181112/wdlabel.json [08:44:07] and then get all the wikidata items in a given language from wikidatawiki.wb_items_per_site [08:44:13] ups [08:44:35] isaacj ^ [08:45:20] joal, previous message was not for you [08:45:48] joal, can we move our meeting to Friday? I'm in a conference. Would work for you? [09:06:53] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Staging environment for upgrades of superset - https://phabricator.wikimedia.org/T212243 (10elukey) https://github.com/apache/incubator-superset/issues/7171 was generated by our setup.py script running webpack. I removed the step and ran p... [10:29:08] joal, nevermind, I'll be on the meeting. [10:38:37] 10Analytics, 10EventBus, 10Beta-Cluster-reproducible, 10Patch-For-Review: Ability to create blocks broken - https://phabricator.wikimedia.org/T219737 (10Krenair) 05Open→03Resolved works, thanks all [11:43:30] 10Analytics, 10Dumps-Generation, 10WikiCite, 10Wikidata, 10Patch-For-Review: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday - https://phabricator.wikimedia.org/T216160 (10ArielGlenn) Summarizing, we have three things at play: - consistence between entity dumps and... [11:50:30] * elukey lunch! [11:55:53] 10Analytics: Aggregate pageviews to Wikidata entities - https://phabricator.wikimedia.org/T215438 (10Sascha) If nobody else has time to do this, may I volunteer to write the code? Please tell me where to start (which programming language, what framework, etc.) [12:58:07] dsaez: maaaaan I'm very sorry I forgot to cancel our meeting this morning - I'm off today and tomorrow :S [12:58:40] dsaez: Please excuse me ... I hope you enjoy the conference - Let's recombine on Friday if you want [13:19:16] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Enable encryption and authentication for TLS-based Hadoop services - https://phabricator.wikimedia.org/T217412 (10elukey) p:05High→03Normal [13:27:34] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Enable encryption and authentication for TLS-based Hadoop services - https://phabricator.wikimedia.org/T217412 (10elukey) Summary of the work done up to now: * Puppet groundwork to deploy trustore/keystore on Hadoop masters/workers and re... [13:30:40] mwarf - Just realized my puppet patch for sqoop didn't change the number of processors - It's still 3 when if should be 10 [13:32:55] It'll work, but will slower than expected - I'll also monitor sqoop for actor and comment tables, making sure they don't go so fast as to be finish before the main tables one [13:43:23] 10Analytics, 10User-Elukey: Upgrade analytics cluster to Cloudera CDH 5.16.1 - https://phabricator.wikimedia.org/T218343 (10elukey) [13:44:42] joal, np, i checked on your calendar [13:55:52] hey everyone [13:58:11] o/ [13:58:39] hey team :] [14:02:11] mforns: I got this response: https://github.com/apache/incubator-druid/issues/6281#issuecomment-478144537 [14:02:21] so maybe "\"fieldName\"" ? [14:03:36] mforns: if that doesn't work, maybe post your transformSpec there as a comment. I'm going to ask how to look at logs. [14:10:02] hey milimetric! thanks [14:10:04] reading [14:10:16] mforns: I just posted a comment asking for how to look at logs [14:10:22] also, milimetric, I found this: https://github.com/apache/incubator-druid/issues/7169 ! [14:10:29] I think that is the whole proble,m [14:11:56] maybe if we use the flattenSpec (without flattening, just to indicate which columns the ingestion should parse) [14:12:00] mforns: oh you're right, I'll post a follow-up, for sure [14:12:26] mforns: could be worth a shot, but it sounds like the parquet optimization they're talking about happens before [14:12:36] hm [14:12:46] do you know if you can read from parquet without the timeAndDims parser? [14:13:37] milimetric, I'm not sure, but what I understand is that everything that is not text, needs the timeAndDims thing [14:13:47] gotcha [14:14:19] let me know if you want to brainbounce a work-around or you want to try a few more variations [14:14:46] milimetric, I will try the flatten thing then, if you want to join, super-welcome [14:14:57] or else, I can tell you what I get [14:47:11] (03CR) 10Elukey: Oozie: add article recommender (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [14:53:34] (03CR) 10Milimetric: [C: 03+2] Enable All Metrics in mobile top menu [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499976 (https://phabricator.wikimedia.org/T219581) (owner: 10Fdans) [14:56:28] (03Merged) 10jenkins-bot: Enable All Metrics in mobile top menu [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499976 (https://phabricator.wikimedia.org/T219581) (owner: 10Fdans) [15:37:17] 10Analytics, 10Community-Tech, 10Event Metrics, 10Pageviews-API: metrics.wmflabs.org pageviews csv now redirecting to eventmetrics forbidden - https://phabricator.wikimedia.org/T219718 (10Milimetric) Hi @mahmoud, we'd like to help you find a solution here. But we were surprised you're still using the wiki... [15:38:03] 10Analytics, 10Community-Tech, 10Event Metrics, 10Pageviews-API: metrics.wmflabs.org pageviews csv now redirecting to eventmetrics forbidden - https://phabricator.wikimedia.org/T219718 (10Milimetric) p:05Triage→03High [15:38:18] 10Analytics, 10Analytics-Wikistats, 10Patch-For-Review: All Metrics link doesn't show up on mobile - https://phabricator.wikimedia.org/T219581 (10Milimetric) p:05Triage→03High [15:39:15] 10Analytics, 10EventBus, 10Operations, 10vm-requests, and 3 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Milimetric) p:05Triage→03High [15:40:29] 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), 10Services (watching): Schema Registry HTTP Service - https://phabricator.wikimedia.org/T219552 (10Milimetric) p:05Triage→03High [15:40:57] 10Analytics, 10Analytics-Kanban: Change permissions for daily traffic anomaly reports on stat1007 - https://phabricator.wikimedia.org/T219546 (10Milimetric) p:05Triage→03High a:03elukey [15:42:07] 10Analytics, 10Discovery, 10Operations, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Milimetric) p:05Triage→03High [15:43:08] 10Analytics, 10Operations, 10Wikimedia-Mailing-lists: Terminate Wikimetrics - https://phabricator.wikimedia.org/T219446 (10Milimetric) p:05Triage→03Normal [15:45:15] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Homepage: specify purging strategy - https://phabricator.wikimedia.org/T219252 (10Milimetric) Please add us as reviewers once you have it worked out. [15:45:56] 10Analytics, 10Analytics-Dashiki, 10good first bug: Detect bad hash in tabs layout - https://phabricator.wikimedia.org/T219235 (10Milimetric) [15:47:52] 10Analytics: Include Tulu Wikipedia in Quarry - https://phabricator.wikimedia.org/T148950 (10Milimetric) 05Open→03Invalid Data is available on labs for this wiki, you can see compiled stats from it on Wikistats: https://stats.wikimedia.org/v2/#/tcy.wikipedia.org [15:48:50] 10Analytics, 10Analytics-Dashiki, 10good first bug: Detect bad hash in tabs layout - https://phabricator.wikimedia.org/T219235 (10Milimetric) p:05Triage→03High [15:50:24] 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey: Archival of home directories on servers with very large homes - https://phabricator.wikimedia.org/T215171 (10Milimetric) p:05Normal→03High [15:51:10] 10Analytics: Add oozie start/restart snippet in coordinator.xml files - https://phabricator.wikimedia.org/T219451 (10Milimetric) 05Open→03Invalid They're already there coordinator.properties, we could update them and make sure they're easily copy/pasted directly. [15:52:58] 10Analytics, 10EventBus, 10Services (watching): EventGate should extract event time from events and produce to kafka with timestamp - https://phabricator.wikimedia.org/T219513 (10Milimetric) p:05Triage→03High [15:53:03] 10Analytics, 10EventBus, 10Release Pipeline, 10serviceops, 10Services (watching): Modern Event Platform: Stream Intake Service: Documentation - https://phabricator.wikimedia.org/T219332 (10Milimetric) p:05Triage→03High [15:54:11] 10Analytics: Spike [2019-2020]. GPU enabled computations. How to do that best - https://phabricator.wikimedia.org/T217367 (10Milimetric) p:05Triage→03Normal [15:54:17] 10Analytics: Update grouped-wiki files for sqoop - https://phabricator.wikimedia.org/T219326 (10Milimetric) p:05Triage→03High [15:54:43] 10Analytics: Migrate jupyter notebooks to kubernetes - https://phabricator.wikimedia.org/T218621 (10Milimetric) p:05Triage→03Normal [15:54:50] 10Analytics: Migrate jupyter notebooks to kubernetes - https://phabricator.wikimedia.org/T218621 (10Milimetric) p:05Normal→03Low [15:55:05] 10Analytics: small bot activity marked as user in Manuel_de_Pedrolo page - https://phabricator.wikimedia.org/T213148 (10Milimetric) p:05Triage→03Normal [16:10:27] 10Analytics, 10Reading Depth, 10Readers-Web-Backlog (Tracking): [Bug] Many ReadingDepth validation errors logged - https://phabricator.wikimedia.org/T216063 (10Milimetric) >>! In T216063#4953716, @phuedx wrote: > The "Extra data:" error is raised by `json.loads` as it encounters the `;` character at the end... [16:24:48] 10Analytics: Wikistats: Change Mercator Projection to Eckert IV - https://phabricator.wikimedia.org/T218045 (10Milimetric) p:05Normal→03High [16:24:59] 10Analytics: Wikistats: Change Mercator Projection to Eckert IV - https://phabricator.wikimedia.org/T218045 (10Milimetric) a:05fdans→03Milimetric [16:27:25] 10Analytics: Percentage increase should be removed from"all" time range on wikistats UI - https://phabricator.wikimedia.org/T205809 (10Milimetric) 05Open→03Resolved this is done and deployed [16:28:21] 10Analytics: Hide unavailable metrics from dashboard - https://phabricator.wikimedia.org/T204717 (10Milimetric) 05Open→03Resolved a:03Milimetric done, only for mobile, but we thought it looked weird on desktop. We can hide once we have at least 3 metrics per area. [16:30:31] 10Analytics, 10Analytics-Wikistats: Wikistats2: Values in map view show unnecessary decimal digits - https://phabricator.wikimedia.org/T200070 (10Milimetric) p:05Normal→03High [16:30:44] 10Analytics, 10Analytics-Wikistats: Audit Wikistats unit testing - https://phabricator.wikimedia.org/T192836 (10Milimetric) p:05Normal→03High [16:31:18] 10Analytics: Gather all constants related to mobile/responsiveness in config - https://phabricator.wikimedia.org/T190339 (10Milimetric) p:05Normal→03High [16:33:10] 10Analytics, 10Analytics-Wikistats: Annotations in wikistats that are only visible on "all" time range get bundled up (probably an issue we cannot resolve until we have a more granular time range) - https://phabricator.wikimedia.org/T200020 (10Milimetric) ping @fdans can you test if this gets better with your... [16:33:55] 10Analytics, 10Analytics-Wikistats: Organize annotations pages on meta by convention - https://phabricator.wikimedia.org/T194706 (10Milimetric) p:05Low→03Normal [16:34:47] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Beta Release: Wikistats: support annotations in graphs - https://phabricator.wikimedia.org/T178015 (10Milimetric) [16:34:49] 10Analytics, 10Analytics-Wikistats: Make Dashiki Extension render annotations pages better - https://phabricator.wikimedia.org/T194708 (10Milimetric) 05Open→03Resolved a:03Milimetric [16:37:24] 10Analytics, 10Analytics-Wikistats: Reindex mediawiki_history_reduced with lookups - https://phabricator.wikimedia.org/T193650 (10Milimetric) p:05Low→03Unbreak! [16:37:26] 10Analytics, 10Analytics-Wikistats: Reindex mediawiki_history_reduced with lookups - https://phabricator.wikimedia.org/T193650 (10Milimetric) p:05Unbreak!→03Triage [16:38:03] 10Analytics, 10Analytics-Wikistats: Use line charts when breaking down a column chart in Wikistats2 - https://phabricator.wikimedia.org/T189200 (10Milimetric) p:05Low→03High a:03mforns [16:38:36] 10Analytics, 10Analytics-Wikistats: Use line charts when breaking down a column chart in Wikistats2 - https://phabricator.wikimedia.org/T189200 (10Milimetric) warning: if this is not fun, reassign to dan or defer to after-beta [16:39:01] 10Analytics, 10Analytics-Wikistats: Allow namespace selection on Top Viewed Articles - https://phabricator.wikimedia.org/T182964 (10Milimetric) p:05Low→03Normal [16:39:47] 10Analytics, 10Analytics-Wikistats: Can't combine 'Editor type' and editor 'Activity level' filters to narrow results (in WikiStats 2.0) - https://phabricator.wikimedia.org/T183316 (10Milimetric) p:05Low→03Normal [16:40:09] 10Analytics, 10Analytics-Wikistats: Needs Design: combine multiple filters and/or splits - https://phabricator.wikimedia.org/T183316 (10Milimetric) [16:42:14] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Pixel ratio messed up on Windows Chrome - https://phabricator.wikimedia.org/T194428 (10Milimetric) 05Open→03Resolved [16:43:05] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Beta Release: Wikistats: support annotations in graphs - https://phabricator.wikimedia.org/T178015 (10Milimetric) [16:43:07] 10Analytics, 10Analytics-Wikistats: Interactively add annotations from Wikistats UI - https://phabricator.wikimedia.org/T194710 (10Milimetric) 05Open→03Declined [16:43:21] 10Analytics, 10Analytics-Wikistats: Wiki popup form to add annotations on meta - https://phabricator.wikimedia.org/T194711 (10Milimetric) 05Open→03Declined [16:43:23] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Beta Release: Wikistats: support annotations in graphs - https://phabricator.wikimedia.org/T178015 (10Milimetric) [16:43:34] 10Analytics, 10Analytics-Wikistats: Annotations in wikistats that are only visible on "all" time range get bundled up (probably an issue we cannot resolve until we have a more granular time range) - https://phabricator.wikimedia.org/T200020 (10Milimetric) p:05Low→03High [16:52:39] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: install new GPU in stat1005 - https://phabricator.wikimedia.org/T219522 (10elukey) If you don't find anybody online to shutdown stat1005 please go ahead and do it, we are not running anything on it! [16:54:41] GPU arrived! [16:54:45] \o/ [17:06:16] 10Analytics, 10Beta-Cluster-Infrastructure, 10EventBus, 10Patch-For-Review, 10Wikimedia-production-error: PHP Warning: Array key should be either a string or an integer - https://phabricator.wikimedia.org/T219738 (10MarcoAurelio) Is this a wmf.24 blocker as T219737 was? [17:39:46] (03CR) 10MarcoAurelio: [C: 03+1] "Wiki created." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499673 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio) [17:40:06] gives me merge conflict ^ [17:40:26] there are two patches with +2 pending submission there [17:42:16] (03Abandoned) 10MarcoAurelio: whitelist: add hyw.wikipedia [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499673 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio) [17:42:41] nvm - dupe patch [17:52:53] hey team, sorry internet hiccup [17:55:49] looking at alarms [17:58:24] a-team: very early drumming rolls buut... [17:58:27] elukey@stat1005:~/square$ ./square.out [17:58:27] info: running on device Vega 10 XT [Radeon PRO WX 9100] [17:58:27] info: allocate host mem ( 7.63 MB) [17:58:27] info: allocate device mem ( 7.63 MB) [17:58:29] info: copy Host2Device [17:58:31] info: launch 'vector_square' kernel [17:58:34] info: copy Device2Host [17:58:36] info: check result [17:58:39] PASSED! [18:00:56] oooh [18:01:37] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: install new GPU in stat1005 - https://phabricator.wikimedia.org/T219522 (10Cmjohnson) [18:02:27] 10Analytics, 10Operations, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10Cmjohnson) [18:02:31] going to test tensorflow tomorrow [18:02:38] but all things that were failing are passing now [18:03:50] * elukey dances [18:04:40] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: install new GPU in stat1005 - https://phabricator.wikimedia.org/T219522 (10Cmjohnson) 05Open→03Resolved card has been swapped [18:05:12] * elukey off! [18:11:40] * neilpquinn tries to turn elukey back on [18:11:47] i see a WX9100 on lspci from stat1005 \o/ [18:14:46] * ebernhardson doesn't see pip or virtualenv though :P [18:14:59] a-team, question about eventlogging: let's say we add a new optional field to the schema for future use and then deploy the new schema version without actually setting the new field. I expect that nothing would change in Hive except the schema's table getting the new column of nulls. Is that right? [18:26:52] neilpquinn: if data does not at all have the new field nothing will happen (no column added) [18:27:31] nuria: okay, makes sense—and I suppose the column will be added the first time an event containing the field is sent? [18:27:33] neilpquinn: for a column to exists some data needs to exist (at this time this might change once we refine data looking at schemas rather than "inferring" schemas) [18:27:38] neilpquinn: yes [18:29:47] 10Analytics, 10Reading Depth, 10Readers-Web-Backlog (Tracking): [Bug] Many ReadingDepth validation errors logged - https://phabricator.wikimedia.org/T216063 (10phuedx) >>! In T216063#5074754, @Milimetric wrote: > We are leaning towards filtering out events coming from non-wikimedia domains. I think that's a... [18:30:26] nuria: is that still true after their most recent EL refine changes, they're reading the schema now [18:31:05] milimetric: no, those changes are not final yet [18:31:16] k [18:31:27] milimetric: sorry, the chnages to read schemas to refine are not yet final [18:54:39] 10Analytics, 10DBA: Proposal: Make centralauth db replicate to all the analytics dbstores - https://phabricator.wikimedia.org/T219827 (10Bawolff) [18:57:17] (03PS1) 10Neil P. Quinn-WMF: Add EditAttemptStep's bucket field to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/500524 [18:57:53] 10Analytics: Refine eventlogging pipeline should not refine data for domains that are not wikimedia's - https://phabricator.wikimedia.org/T219828 (10Nuria) [18:58:36] 10Analytics: Refine eventlogging pipeline should not refine data for domains that are not wikimedia's - https://phabricator.wikimedia.org/T219828 (10Nuria) ping @phuedx and @Jdlrobson so they are aware this ticket exists [18:58:49] 10Analytics, 10DBA: Proposal: Make centralauth db replicate to all the analytics dbstores - https://phabricator.wikimedia.org/T219827 (10Bawolff) > (e.g. Checking what percentage of enwiki admins have 2FA enabled). Another use case I recently encountered, is I wanted to see skin statistics for active users on... [19:02:25] (03CR) 10Nuria: Add EditAttemptStep's bucket field to EventLogging whitelist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/500524 (owner: 10Neil P. Quinn-WMF) [19:08:05] 10Analytics, 10DBA: Proposal: Make centralauth db replicate to all the analytics dbstores - https://phabricator.wikimedia.org/T219827 (10Marostegui) That's not possible cause it would require setting up multi-source replication on all the sections that are not s7 (where centralauth database lives). We are not... [19:24:34] 10Analytics, 10Dumps-Generation, 10WikiCite, 10Wikidata, 10Patch-For-Review: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday - https://phabricator.wikimedia.org/T216160 (10Nicolastorzec) Thanks for the summary Ariel. FWIW: - Changing the Wikidata dump generation... [19:38:52] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on icinga1001 is CRITICAL: 1843 gt 1000 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [19:43:49] (03CR) 10Bmansurov: "Moritz, could you chime in on the conversation above? I'd appreciate your help. Thanks!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [20:01:29] (03CR) 10Mforns: Add EditAttemptStep's bucket field to EventLogging whitelist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/500524 (owner: 10Neil P. Quinn-WMF) [20:01:48] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on icinga1001 is OK: (C)1000 gt (W)100 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [20:35:33] (03CR) 10Neil P. Quinn-WMF: Add EditAttemptStep's bucket field to EventLogging whitelist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/500524 (owner: 10Neil P. Quinn-WMF) [20:39:11] 10Analytics, 10Beta-Cluster-Infrastructure, 10EventBus, 10Patch-For-Review, 10Wikimedia-production-error: PHP Warning: Array key should be either a string or an integer - https://phabricator.wikimedia.org/T219738 (10MarcoAurelio) [20:43:18] 10Analytics, 10DBA: Proposal: Make centralauth db replicate to all the analytics dbstores - https://phabricator.wikimedia.org/T219827 (10Milimetric) @Bawolff I ran into the same issue, and sadly as Manuel pointed out we can't do that going forward. Our team's proposal is to replicate mysql to Hadoop via a sol... [20:47:15] (03CR) 10Nuria: [C: 03+2] Add EditAttemptStep's bucket field to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/500524 (owner: 10Neil P. Quinn-WMF) [20:52:10] Nettrom: yt? [20:52:20] nuria: yes! [20:53:17] Nettrom: on the homepage schema there is a homepage_pageview_token ""One-time token per page load.", FYI that there is already a pageview token [20:53:49] nuria: yeah, but that's shared between schemas and stored [20:54:03] Nettrom: that gets calculated on every pageview so if that is what you are looking for (seems like it) you do not need an additional one to be calculated [20:54:45] hm, in this case we might be able to reuse it, though… let me think about this [20:54:54] Nettrom: https://github.com/wikimedia/mediawiki/blob/master/resources/src/mediawiki.user.js#L84 [21:09:17] nuria: thanks for bringing this up! I think in this case we might just reuse the pageview token, I'll discuss it with the team's engineers [21:09:55] Nettrom: there was another schema that i now cannot find that i think also had this field right? [21:11:09] nuria: all the schemas that we are planning to use for the Homepage will reuse this token, currently that's three: HomepageVisit, HomepageModule, and HelpPanel [21:13:16] Nettrom: i see, it makes sense to reuse in all three probably, that way instrumentation code can ask for it , does not need to cache it in memory for the other events etc [21:14:47] 10Analytics, 10Product-Analytics, 10MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), 10Patch-For-Review: Standardize datetimes/timestamps in the Data Lake - https://phabricator.wikimedia.org/T212529 (10Ottomata) > it sounds like the long-term vision you have is using ISO strings for EventLogging and similar data... [21:15:16] cc-ing RoanKattouw so he knows about pageview token , please see conversation with Nettrom right above [21:18:13] 10Analytics: Review parent task for any potential pageview definition improvements - https://phabricator.wikimedia.org/T156656 (10Milimetric) oh! thanks! I was confused, now I'm not :) [21:23:12] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Fundraising-Backlog, and 2 others: Fix EventLogging schemas that use array for items type - https://phabricator.wikimedia.org/T218617 (10Ottomata) > We will never be able to support this freedom in schemas that are persisted to hive, correct? Co... [21:33:08] 10Analytics, 10Analytics-Kanban, 10EventBus, 10MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), 10Patch-For-Review: Make Refine use JSONSchemas of event data to support Map types and proper types for integers vs decimals - https://phabricator.wikimedia.org/T215442 (10nettrom_WMF) [21:45:37] 10Analytics-Kanban, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10Tbayer) @elukey Sure, that totally makes sense! The end of January estimate from T178802#4647106 turned out a bit optimistic (see again our internal tim... [22:17:23] Are kafka-jumbo* logs going anywhere? I'm getting some SocketTimeoutException from spark trying to read, and wondering if kafka is emitting any logs on the other side [22:17:39] using `host:kafka-jumbo*` in logstash shows nothing though [22:18:52] i've been avoiding upgrading to the new consumer libs, since there is no python support, but maybe we need to ... [22:21:48] PROBLEM - Varnishkafka Eventlogging Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=eventlogging&var-host=All [22:22:12] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on icinga1001 is CRITICAL: 2.689e+05 gt 1000 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [22:24:29] hmm, oddly suspicious timing :S my spark job died at 22:17:49 fwiw [22:25:08] it would have started reading kafka at 22:11:23 [22:29:26] PROBLEM - Throughput of EventLogging NavigationTiming events on icinga1001 is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [22:33:25] ebernhardson: no kafka logs that i know off on logstash but there might be some for which you need ops superpowers [22:36:02] ebernhardson: argh there are tons of kafka errors [22:37:07] * ebernhardson hopes he didn't cause those somehow, the timing seems plausibly related [22:38:35] PROBLEM - EventLogging overall insertion rate from MySQL consumer on graphite1004 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [22:57:37] the 'latency to kafka brokers' tab of varnishkafka looks to line up pretty well with my thing pulling data from kafka, atm i'm rewriting to use python kafka consumer instead of java so i can control things like throttling in the future. [22:58:16] (this was using java one, i decided since i can't seem to get good info out of the java one and the 0.10+ KafkaRDD isn't available in python to just write out our use case explicitly) [22:59:17] the volume wasn't particularly high though, bytes out by topic only hit 20 MB/s across 35 partitions [23:00:12] ebernhardson: but wait this is consuming [23:00:42] ebernhardson: it will cause it not to be able to ingest? [23:01:14] nuria: this was consuming, the data was produced earlier at a slower rate [23:02:05] production was ~70msg/s and < 1MB/s over 40 minutes [23:02:49] mjolnir does a weird thing with kafka where it waits to find out where the end of the data it wants is (via a backchannel) so it can create spark KafkaRDD that contains the full range [23:03:27] ebernhardson: what is the topic? [23:03:33] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?refresh=5m&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=mjolnir.msearch-prod-response [23:04:57] switching it over to the standard python kafka consumer will let us have further control over throttling, that was probably however fast spark<->kafka decided to talk [23:05:33] ebernhardson: everything looks kaput [23:05:36] ebernhardson: https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad%20prometheus%2Fops&var-lag_datasource=eqiad%20prometheus%2Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad&refresh=5m&orgId=1&from=now-2h&to=now [23:06:51] nuria: indeed it looks like maybe the SocketTimeoutExceptions that my spark app received also went to other consumers, and they didn't restart? [23:07:00] ebernhardson: looks like we do not have any EL events [23:07:56] nuria: actually, owned partitions per host is also way off, did some kafka servers fall over? Without any logging in logstash I can't do any better than guess in the wind [23:08:18] nuria: you might want to ping andrew? [23:09:01] ebernhardson: I just did [23:09:58] ebernhardson: who was the ops person that did the logstash kafka work? [23:10:55] nuria: godog/fillipo i think? he's in EU [23:12:53] ebernhardson: ya, i am trying to see who could help us here and can only think of brandon [23:13:39] i poked mediawiki_security, but only turning up krinkle who doesn't have sre access either [23:15:16] ebernhardson: ok, all eventlogging events are not coming [23:22:47] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [23:22:53] 10Analytics: jumbo kafka cluster is down - https://phabricator.wikimedia.org/T219842 (10Nuria) [23:23:17] ebernhardson: can you describe what you think happen with your consumer here: https://phabricator.wikimedia.org/T219842 (briefly) [23:23:31] ebernhardson: i have called andrew but he is not available [23:24:13] nuria: someone from logging on sre is looking at logs now (via #mediawiki_security chan) [23:24:19] cwhite [23:24:23] ebernhardson: ah let me join [23:32:05] 10Analytics: jumbo kafka cluster is down - https://phabricator.wikimedia.org/T219842 (10Nuria) Alarms at 3:21 PST: icinga-wm> PROBLEM - Varnishkafka Eventlogging Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboa... [23:39:18] 10Analytics, 10Analytics-Wikistats, 10Research: Renovation of Wikistats production jobs - https://phabricator.wikimedia.org/T176478 (10leila) @Erik_Zachte I'm going to mark this one as resolved. If you don't agree, please re-open. [23:39:22] 10Analytics, 10Analytics-Wikistats, 10Research: Renovation of Wikistats production jobs - https://phabricator.wikimedia.org/T176478 (10leila) 05Open→03Resolved [23:41:56] 10Analytics: jumbo kafka cluster is down - https://phabricator.wikimedia.org/T219842 (10EBernhardson) Possibly related, mjolnir in the analytics cluster started reading via spark KafkaRDD `mjolnir.msearch-prod-response` around 22:11:23 and died around 22:17:49 with a variety of `SocketTimeoutException` coming fr... [23:43:35] RECOVERY - EventLogging overall insertion rate from MySQL consumer on graphite1004 is OK: OK: Less than 20.00% under the threshold [50.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [23:44:04] 10Analytics, 10Research: Migrate pagecounts-ez generation to hadoop - https://phabricator.wikimedia.org/T192474 (10leila) @Nuria what shall we do with this task? Anything Research can help with? [23:53:31] nuria: i gotta run to the preschool, it looks like competent hands are dealing with this and I'm not doing much (besides maybe breaking it in the first place...) [23:58:41] 10Analytics, 10Research: Migrate pagecounts-ez generation to hadoop - https://phabricator.wikimedia.org/T192474 (10Nuria) We have no plans to tackle this for the next couple of quarters. [23:58:51] 10Analytics: jumbo kafka cluster is down - https://phabricator.wikimedia.org/T219842 (10EBernhardson) This graph suggests the problem started perhaps 22:11, timing very close to mjolnir/spark starting to read: https://grafana.wikimedia.org/d/000000234/kafka-by-topic?refresh=5m&orgId=1&var-datasource=eqiad%20prom...