[05:20:53] PROBLEM - Check the last execution of refinery-import-page-current-dumps on stat1007 is CRITICAL: CRITICAL: Status of the systemd unit refinery-import-page-current-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:07:33] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Create HDFS /tmp/ cleaner - https://phabricator.wikimedia.org/T235200 (10elukey) On an-coord1001 I can see: ` -- Logs begin at Tue 2019-11-05 13:34:31 UTC, end at Wed 2019-11-06 07:06:56 UTC. -- Nov 05 23:00:01 an-coord1001 systemd[... [08:59:11] PROBLEM - Check the last execution of analytics-database-meta-snapshot-copy-to-hdfs on an-master1002 is CRITICAL: NRPE: Command check_check_analytics-database-meta-snapshot-copy-to-hdfs_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:59:13] 10Analytics, 10Analytics-Kanban: Prepare the Hadoop Analytics cluster for Kerberos - https://phabricator.wikimedia.org/T237269 (10elukey) [09:01:07] this is me, no issue --^ [09:12:50] joal: bonjour (I know you are out, just leaving a note) - I tested pyarrow's hdfs support, that uses libhdfs's JNI bindings, and it works fine with encrypted RPC [09:13:45] snakebite is a pure python solution I know but it needs constant care to re-invent what it already in libhdfs.. [10:12:06] 10Analytics, 10Discovery, 10Event-Platform, 10Wikidata, and 3 others: Log Wikidata Query Service queries to the event gate infrastructure - https://phabricator.wikimedia.org/T101013 (10dcausse) Tested sending an event to eventgate in beta generated by new code: ` curl -XPOST -H"Content-Type: application/js... [10:22:23] bbiab [10:36:49] (03PS1) 10Awight: Fix nonexistent field in query [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/549054 (https://phabricator.wikimedia.org/T214493) [10:42:54] Is there something wrong with eventlogging->hadoop ingestion, since November 1st? [10:46:19] 10Analytics, 10WMDE-Analytics-Engineering, 10Wikidata: Track WDQS updater UA in wikidata-special-entitydata grafana dashboard - https://phabricator.wikimedia.org/T218998 (10Rosalie_WMDE) [10:47:03] 10Analytics, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata-Campsite: Track WDQS updater UA in wikidata-special-entitydata grafana dashboard - https://phabricator.wikimedia.org/T218998 (10Rosalie_WMDE) [10:48:00] elukey: what's the purpose of eventgate-analytics vs eventgate-main ? [10:48:22] and hi :) [10:49:04] awight: in theory no, what are you seeing? [10:50:14] dcausse: hi! My understanding is that we want to separate analytics-related events from something like job-queue events from mediawiki [10:50:30] different availability requirements, workloads, etc.. [10:51:22] not sure if it answers the question [10:51:22] eventgate-analytics does not have a discovery service like eventgate-main.discovery.wmnet? [10:52:09] nope, it produces to kafka-jumbo that is only in eqiad [10:52:10] https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate#EventGate_Services [10:52:54] let's say I have an app in codfw that wants to produce "analytic" events should I use eventgate-analytics.eqiad.wmnet ? [10:54:14] dcausse: is the analytics event already registered in the schema repository etc...? If so I'd say yes, but better ask to Andrew [10:54:32] in any case, the alternative would be to use eventgate-main to have eqiad/codfw [10:54:45] elukey: yes the schema is ready, just not sure which endpoint to use yet [10:55:19] I think cirrus emits analytic events to -main and it got somehow replicated to hadoop, but it's unclear to me how [10:55:50] yes this is due to Kafka mirror maker [10:56:05] kafka main eqiad and codfw mirror each other topics [10:56:20] and kafka jumbo pulls topics from main-eqiad [10:56:25] ah [10:56:31] that must be the reason [10:56:36] and then camus pulls from jumbo and eventually it gets to hdfs [10:57:35] I probably want to use -main then, but will double check with Andrew, thanks! [10:57:46] exactly, let me know! I am interested :) [10:59:41] elukey: I've been working with two event streams which all seem to grind to a halt around the 1st. -- but meanwhile, I found a third stream which looks just fine. I'll start with the client-side of the pipeline, since the numbers are 1000x lower but not zero, so I think the few remaining events are from stale browser sessions. [11:01:15] awight: ack, let me know if I can help! [11:04:16] Thanks! Yeah this corresponds to a train, must be my own mistake. [11:29:05] 10Analytics, 10WMDE-Analytics-Engineering, 10WMDE-FUN-Funban-2019, 10WMDE-FUN-Sprint-2019-10-14, 10WMDE-New-Editors-Banner-Campaigns (Banner Campaign Autumn 2019): Implement banner design for WMDEs autum new editor recruitment campaign - https://phabricator.wikimedia.org/T235845 (10GoranSMilovanovic) @aw... [11:30:38] 10Analytics-Kanban, 10Analytics-Wikistats: Strip www from project in URL if it's included - https://phabricator.wikimedia.org/T237520 (10fdans) [11:43:05] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Rerun sanitization before archiving eventlogging mysql data - https://phabricator.wikimedia.org/T236818 (10elukey) Sanitization is still running on the two databases! [11:44:57] 10Analytics-Kanban, 10Analytics-Wikistats: Strip www from project in URL if it's included - https://phabricator.wikimedia.org/T237520 (10fdans) p:05Triage→03High [12:01:11] (03PS2) 10Fdans: Add mediarequests per referer metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/542999 [12:01:57] * elukey lunch! [13:32:13] (03CR) 10Joal: [V: 03+2] "Merging for today's deploy train" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/548736 (https://phabricator.wikimedia.org/T233661) (owner: 10Joal) [14:16:43] just kinited from stat1005 [14:16:45] \o/ [14:16:52] hehe :) [14:16:55] awesome elukey [14:17:11] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Prepare the Hadoop Analytics cluster for Kerberos - https://phabricator.wikimedia.org/T237269 (10elukey) [14:17:21] elukey: I'm looking at logs for import_pages_current_dumps and couldn't find anything :( [14:17:45] joal: journalctl [14:18:06] Ahhh [14:18:19] sudo journalctl -u refinery-import-page-current-dumps.service [14:18:40] it seems a json decode error but I had no idea how to fix it ;( [14:19:50] elukey: could it be possible that we tried to read the file while the job scheduler tried to write it? [14:20:14] elukey: I looked at the file and it looks good [14:20:34] elukey: many we can try a manual run? Or reset the timer so that it continues its job tomorrow? [14:21:17] elukey: it would be: sudo systemctl reset-failed refinery-import-page-current-dumps.service [14:21:20] right? [14:22:27] Also elukey - Would now be a good time for you to talk about oozie-api-script? [14:23:19] joal: fine if you want to restart, reset-failed is also correct :) [14:23:36] yep for oozie, I am reviewieng kerberos stuff [14:23:54] Let's reset-failed and see if the error occurs tomorrow morning again [14:23:54] I can take a break for mental sanity [14:23:57] :) [14:24:10] is it ok to restart it? So we'll know sooner [14:24:19] I'm not sure talking about oozie is a mental-sanity break :) [14:24:26] elukey: sure! [14:24:40] 10Analytics, 10Discovery, 10Event-Platform, 10Wikidata, and 3 others: Log Wikidata Query Service queries to the event gate infrastructure - https://phabricator.wikimedia.org/T101013 (10Ottomata) > Sorry about this stupid queue name No worries, its beta ┐|・ิω・ิ#|┌ [14:25:58] I am in the cave whenever you are good [14:26:05] Ah sorry [14:33:31] hmm elukey can I not use e.g. $(/usr/bin/hadoop classpath) in a systemd unit ExecStart?! [14:33:32] ha [14:33:36] right not a shell script... [14:33:37] ghrrr [14:33:42] ok gotta make a wrapper [14:36:03] ottomata: o/ in a meeting with Joseph tell me if yoi need help :) [14:36:15] (03PS1) 10Ottomata: Add bin/hdfs-cleaner wrapper script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/549092 (https://phabricator.wikimedia.org/T235200) [14:36:53] milimetric: you doing train today ya? [14:39:33] yes ottomata, after standup [14:39:56] well, I’m missing standup actually so after techcom [14:40:24] ok cool [15:01:00] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Version analytics meta mysql database backup - https://phabricator.wikimedia.org/T231208 (10elukey) ` Terminated Jobs: JobId Level Files Bytes Status Finished Name ======================================================... [15:14:55] 10Analytics, 10New-Readers: Add KaiOS to the list of OS query options for pageviews in Turnilo - https://phabricator.wikimedia.org/T231998 (10SBisson) The PR has been merged. Can we update the library to the latest on master or do we need an official release from them? [15:36:06] hey team! [15:39:03] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/549054 (https://phabricator.wikimedia.org/T214493) (owner: 10Awight) [15:43:21] joal: and elukey, would love to get HDFSCleaner changes in for train today [15:43:24] review for joal: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/548909 [15:43:33] review for elukey: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/549092 [15:45:08] (03CR) 10Elukey: [C: 03+1] "LGTM, can also be done in puppet, but here is fine as well (the only thing that might be missing is the /usr/lib/hadoop stuff)." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/549092 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [15:45:47] (03CR) 10Ottomata: "> Patch Set 1: Code-Review+1" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/549092 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [15:45:58] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/547713 (owner: 10Awight) [15:46:17] (03CR) 10Elukey: [C: 03+1] "Yep all good from my side, just wanted to mention it :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/549092 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [15:46:28] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add bin/hdfs-cleaner wrapper script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/549092 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [15:46:31] ty [15:49:53] 10Analytics, 10New-Readers: Add KaiOS to the list of OS query options for pageviews in Turnilo - https://phabricator.wikimedia.org/T231998 (10Nuria) Once we have a release we can update packages on our end. [16:01:39] ottomata: historic moment, I am ready to deploy keytabs to all the nodes [16:01:42] \o/ [16:02:40] oh boy [16:02:43] go 4 it! [16:11:38] (03CR) 10Mforns: Add python script to generate intervals for long backfilling (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/547750 (https://phabricator.wikimedia.org/T237119) (owner: 10Fdans) [16:12:13] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 6 others: Modern Event Platform: Stream Configuration: Implementation - https://phabricator.wikimedia.org/T233634 (10Pchelolo) I'm not opposed to having `stream_name` and `stream_pattern`, but > Here, I'm mutating the orig... [16:17:58] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Operations, and 4 others: Public schema.wikimedia.org endpoint for schema.svc - https://phabricator.wikimedia.org/T233630 (10Ottomata) Ok @ema @akosiaris patches for schema.discovery.wmnet ^. Not sure if I need this but I think it would be good fo... [16:32:29] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 6 others: Modern Event Platform: Stream Configuration: Implementation - https://phabricator.wikimedia.org/T233634 (10Ottomata) > If the client depends on having the original stream name in stream property inside the stream c... [16:43:07] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Prepare the Hadoop Analytics cluster for Kerberos - https://phabricator.wikimedia.org/T237269 (10elukey) [16:52:42] 10Analytics, 10GLOW: Publish Superset dashboard - https://phabricator.wikimedia.org/T237549 (10Iflorez) [17:00:39] ottomata: will review your patch just after standup [17:01:04] (03PS1) 10Mforns: Add Spark job to update data quality table with incoming data [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/549115 (https://phabricator.wikimedia.org/T235486) [17:26:26] ottomata: wanna batcave? [17:26:29] quickly [17:26:44] 10Analytics, 10GLOW: Publish Superset dashboard - https://phabricator.wikimedia.org/T237549 (10Nuria) {F31027217} All superset users have access to the dashboard as is. I see it on the dashboard list. [17:27:12] probably out for lunch, will check laterz :) [17:27:19] (going afk a bit as well) [17:28:12] elukey [17:28:17] have event platform meeting now [17:28:21] let's do later [17:32:29] even tomorrow! [17:35:41] nuria: Is there any x-cache cp that I should avoid? Yesterday I filtered by entries that had cp3033 in x-cache, but that was not very active for upload.wikimedia.org . [17:35:53] joal: review https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/548909 so we can get in train today? [17:36:10] yessir, on it [17:36:13] ty [17:36:23] 10Analytics, 10GLOW: Publish Superset dashboard - https://phabricator.wikimedia.org/T237549 (10Iflorez) I see, thank you for clarifying that. I wasn't sure if the dashboard was available beyond my own view. When I log in and open the dashboard I see "DRAFT" next to the title. When I click on the draft button... [17:40:23] Is it possible to get custom data dumps? I'd like to make a graph network out of the encyclopedia internal links but pagelinks mixes numeric IDs and article names. [17:43:18] lexnasser: you can see what caches are active at https://config-master.wikimedia.org/pybal [17:43:19] e.g. [17:43:23] https://config-master.wikimedia.org/pybal/eqiad/upload [17:43:32] https://config-master.wikimedia.org/pybal/esams/upload [17:43:46] cp3033 is not listed as an upload cache there [17:44:20] ottomata: I’ll check those out, thanks! [17:46:32] a-team: I'm done with techcom in 15 minutes, and I will run the deployment train according to the etherpad/kanban, do let me know if there are any exceptions [17:46:42] (03CR) 10Joal: "One nit, but overall looks good" (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/548909 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [17:46:59] milimetric: We should be done with ottomata for the patch he wants in [17:47:06] 10Analytics, 10GLOW: Publish Superset dashboard - https://phabricator.wikimedia.org/T237549 (10Nuria) >A related question, is it possible to share this dashboard with GLOW team members that are not on Superset? Well, to see a dashboard on superset you need access to superset unless you are sharing your screen.... [17:47:10] ottomata: --^ [17:47:39] joal: ok, I'll wait for yall to merge that then (548909, right?) [17:47:48] yessir :) [17:47:51] thanks milimetric [17:50:00] (03PS4) 10Ottomata: HDFSCleaner improvements [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/548909 (https://phabricator.wikimedia.org/T235200) [17:50:06] (03CR) 10Ottomata: HDFSCleaner improvements (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/548909 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [17:50:15] ty [17:50:20] will merge after tests pass [17:50:36] np ottomata - thank you :) [17:53:46] 10Analytics, 10GLOW: Publish Superset dashboard - https://phabricator.wikimedia.org/T237549 (10Iflorez) 05Open→03Resolved [17:54:15] 10Analytics, 10GLOW: Publish Superset dashboard - https://phabricator.wikimedia.org/T237549 (10Iflorez) Thank you [18:01:43] (03CR) 10Ottomata: [C: 03+2] HDFSCleaner improvements [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/548909 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [18:01:54] merged milimetric ty for waiting [18:02:04] lemme know when train has landed [18:10:32] joal: o/ [18:10:37] did we restart the failed job? [18:10:44] elukey: oh, we didn't! [18:10:49] elukey: shall I do so ? [18:11:25] elukey: thanks a lot for thinkiing of that! I had to talk about it at standup and forgot :( [18:11:37] yep please restart :) [18:12:23] sudo systemctl start name-of-the-timer.service ? [18:12:27] elukey: --^ [18:12:28] yep! [18:12:33] ack! [18:13:33] !log restart refinery-import-page-current-dumps.service to test after yestardays failure [18:13:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:14:02] elukey: restarted, tailing logs [18:14:11] super [18:14:14] excuse me for having forgotten [18:14:47] nono I did that too, we had other stuff to think about :) [18:20:38] RECOVERY - Check the last execution of refinery-import-page-current-dumps on stat1007 is OK: OK: Status of the systemd unit refinery-import-page-current-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:23:08] \o/ [18:24:04] (03PS2) 10Mforns: Add Spark job to update data quality table with incoming data [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/549115 (https://phabricator.wikimedia.org/T235486) [18:25:24] elukey: must have been bad luck on file writing [18:35:22] elukey: wanna chat? [18:35:51] (03PS3) 10Mforns: Add Spark job to update data quality table with incoming data [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/549115 (https://phabricator.wikimedia.org/T235486) [18:36:31] ottomata: sure [18:36:37] k bc [18:50:53] * elukey off! [18:56:58] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Operations, and 4 others: Public schema.wikimedia.org endpoint for schema.svc - https://phabricator.wikimedia.org/T233630 (10Ottomata) a:03Ottomata [18:57:09] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 5 others: Public schema.wikimedia.org endpoint for schema.svc - https://phabricator.wikimedia.org/T233630 (10Ottomata) [19:06:55] 10Analytics, 10Tool-Pageviews: Topviews Analysis of the Hungarian Wikipedia is flooded with spam - https://phabricator.wikimedia.org/T237282 (10Nuria) This is a bot, see patterns that are symmetric per UA (just looked at Orális_szex page) +---+-------------+-------------------+--------------------------------... [19:07:12] ottomata: Heya - I'm inviting you to a meeting tomorrow [19:07:23] 10Analytics, 10Tool-Pageviews: Topviews Analysis of the Hungarian Wikipedia is flooded with spam - https://phabricator.wikimedia.org/T237282 (10Nuria) This is a bot, see patterns that are symmetric per UA (just looked at Orális_szex page) +---+-------------+-------------------+--------------------------------... [19:07:32] ottomata: meeting is with Will Doran, on how analytics infra could help on counting revisions [19:07:57] ottomata: I think it'll be early for you, let me know if not ok [19:08:15] ok! [19:08:17] 10Analytics, 10Research-Backlog: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207 (10Nuria) [19:08:20] 10Analytics, 10Tool-Pageviews: Topviews Analysis of the Hungarian Wikipedia is flooded with spam - https://phabricator.wikimedia.org/T237282 (10Nuria) [19:08:43] joal: maybe too early we also have meeting with djellel and martin ya? [19:08:54] oh ya i probably won'tm ake that one with will [19:08:55] ottomata: yep [19:09:22] hm - Shall I ask him to move to Friday, same time as we have the meeting with Martin and Djellel? [19:10:55] Like 9:30am on Friday morning? [19:10:58] ottomata: -^ [19:11:26] joal this friday no good for me! [19:11:30] i think go ahead and do without me [19:11:38] you probbably know all and more than I do anyway :) [19:11:46] huhu [19:11:51] ok ottomata :) [19:14:02] nos: custom data dumps aren't really our thing, but maybe you can query the mediawiki API for the data you need? I think you can get the page id or page title given one or the other (I think?) [19:14:54] (03PS3) 10Mforns: Refactor data_quality oozie bundle to fix too many partitions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/547320 (https://phabricator.wikimedia.org/T235486) [19:15:41] (03PS4) 10Mforns: Refactor data_quality oozie bundle to fix too many partitions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/547320 (https://phabricator.wikimedia.org/T235486) [19:16:40] (03PS5) 10Mforns: Refactor data_quality oozie bundle to fix too many partitions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/547320 (https://phabricator.wikimedia.org/T235486) [19:17:18] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Figure out how to $ref common schema across schema repositories - https://phabricator.wikimedia.org/T233432 (10Ottomata) Moving this ticket to Done. We have a plan. We can continue bikeshed about repo names in {T20... [19:18:03] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata) [19:21:03] 10Analytics, 10Tool-Pageviews: Topviews Analysis of the Hungarian Wikipedia is flooded with spam - https://phabricator.wikimedia.org/T237282 (10Nuria) Numbers above are just for 1 hour. [19:23:26] 10Analytics, 10Tool-Pageviews: Topviews Analysis of the Hungarian Wikipedia is flooded with spam - https://phabricator.wikimedia.org/T237282 (10Nuria) Also if you look at teh pageviews from this IP from 1 day these are the titles requested. +-----+----------------------+ |c |page_title | +-----+... [19:23:53] joal: just looked at daily numbers for this hungarian bot [19:24:02] joal: (i was looking hourly before) [19:24:20] And? [19:24:23] nuria: --^ [19:24:26] joal: and daily requests per (ip, ua) are [19:24:49] joal: 25991 in the *sex page [19:25:05] joal: overall [19:25:19] joal; sorry above number is overall [19:25:35] nuria: could we have the number of distinct UAs associated? [19:25:45] joal: one sec and i can tell you that one, just recalculated [19:26:17] nuria: This would give us an average nbPageviews / UA for a single page probably way to high [19:28:08] joal: ya, so i think algorithm as is would catch these (still getting numbers to verify) [19:30:48] 10Analytics, 10Analytics-EventLogging, 10QuickSurveys, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: QuickSurveys EventLogging missing ~10% of interactions - https://phabricator.wikimedia.org/T220627 (10Isaac) Thanks @Nuria ! > if your survey is happening all in desktop The survey happens on bo... [19:39:05] nuria: re: the script, I see now what you meant with the getting it off the if block, you want the tests to be ran with each execution to check that everything's fine before launching the oozie job [19:40:12] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 6 others: Modern Event Platform: Stream Configuration: Implementation - https://phabricator.wikimedia.org/T233634 (10Ottomata) Or also, I guess there's no need to mutate the `stream` value at all, even if I always return an... [19:50:47] 10Analytics, 10ChangeProp, 10Event-Platform, 10Core Platform Team (Needs Cleaning - Security, stability, performance, and scalability (TEC1)): Enable controlled debug logging for change-prop - https://phabricator.wikimedia.org/T189621 (10CCicalese_WMF) 05Open→03Declined [19:50:50] 10Analytics, 10Event-Platform, 10Services (done), 10User-Elukey: Kafka sometimes misses to rebalance topics properly - https://phabricator.wikimedia.org/T179684 (10CCicalese_WMF) [19:54:06] (03CR) 10Joal: "Adding my bunch of comments as well" (035 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/547750 (https://phabricator.wikimedia.org/T237119) (owner: 10Fdans) [19:58:35] fdans: can talk in a bit , on meeting [20:05:42] joal: 3945 per (ip, UA) pair on 10/16 so ya, within bounds of algo (can talk in a bit) [20:13:59] Interesting! A stuck pageview job ... [20:14:13] milimetric: shall I restart it or do you want to do it? [20:14:48] joal: I am back and was about to finally do the train [20:14:53] I can restart it, no prob [20:15:43] milimetric: it is in "running" mode in hue, but it makes too long (whitelist-check runs since 14:38:21...) [20:16:02] woah, that's forever [20:16:08] Let's force kill and rerun please milimetric (i can do it, you have enough with the train) [20:16:19] uh, if you're looking at it, restart it, and that way I can look again later if it's still in trouble [20:16:28] ack milimetric :) [20:16:31] thx! [20:16:57] !log Kill-rerun pageview-hourly-wf-2019-11-6-13 for being stuck in whitelist-check [20:16:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:17:39] nuria: I'm gone for diner, let's talk tomorrow if ok for you [20:18:26] fdans: i was not thinking so much of tests running everytime (which tehy can) but rather as not having any "special code for testing" [20:18:29] ok, choo choo, refinery only, no refinery-source changes on etherpad [20:18:54] milimetric: Andrew's patch in on source [20:19:12] milimetric: I let you sync with him [20:19:39] ottomata: that needs to be on etherpad, then [20:20:01] ohh [20:20:10] what is the etherpad? [20:20:21] milimetric: ? [20:20:58] https://etherpad.wikimedia.org/p/analytics-weekly-train should have a mention to everything we deploy that week [20:21:19] I'm not sure about https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/549092/ [20:21:31] and I'm adding this to the etherpad now: https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/548909/ [20:22:08] ok added milimetric [20:22:17] k [20:22:28] thanks [20:22:50] (03PS6) 10Mforns: Refactor data_quality oozie bundle to fix too many partitions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/547320 (https://phabricator.wikimedia.org/T235486) [20:22:57] I actually don't know much about the hdfs cleaner, so that has to be restarted after the deploy ottomata? [20:23:43] I see it's this: https://phabricator.wikimedia.org/T235200, I put it in ready to deploy [20:24:31] milimetric: i'll fix once you deploy [20:24:35] i have to do a puppet change [20:24:41] oh! ok [20:24:51] makes sense [20:26:43] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 6 others: Modern Event Platform: Stream Configuration: Implementation - https://phabricator.wikimedia.org/T233634 (10Pchelolo) mmm... if we convert the array into a key-value map, we loose the ordering of the array. Thus, we... [20:29:29] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 6 others: Modern Event Platform: Stream Configuration: Implementation - https://phabricator.wikimedia.org/T233634 (10Ottomata) Hm, yeah. I guess eventgate should then be modified to work with an array and iterate through to... [20:29:40] (03PS7) 10Mforns: Refactor data_quality oozie bundle to fix too many partitions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/547320 (https://phabricator.wikimedia.org/T235486) [20:31:26] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 6 others: Modern Event Platform: Stream Configuration: Implementation - https://phabricator.wikimedia.org/T233634 (10Ottomata) BTW, notice I'm talking about the return result of EventStreamConfig lookup, not the array in Med... [20:33:20] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 6 others: Modern Event Platform: Stream Configuration: Implementation - https://phabricator.wikimedia.org/T233634 (10Pchelolo) yeah, I understand. I mean, once in eventuate you have the map, you'd need pretty much the same c... [20:41:14] nuria: okeeeei putting it in a test module inside the file [21:07:00] fdans: k! [21:08:58] 10Analytics: Request for a large request data set for caching research and tuning - https://phabricator.wikimedia.org/T225538 (10lexnasser) Hi @Danielsberger , I'm almost finished compiling the data. This is what the dataset would look like: NOTE THAT THIS IS **FAKE** EXAMPLE DATA . |*Notes*|relative_unix|hash... [21:21:15] !log refinery deployed, hdfs cleaner and tls ready to be restarted [21:21:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:21:38] !log restarting webrequest load bundle and druid loading jobs [21:21:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:21:47] (there will be manual steps in there, so this might take a bit) [21:31:16] ottomata: I deployed refinery, so you can do your hdfs cleaner thing [21:31:22] I'm having some trouble [21:31:37] with altering wmf_raw.webrequest, because of JsonSerDe not found etc. [21:31:52] I'm always forgetting what's going on there it's so weird [21:31:57] ok lthank you! [21:32:10] milimetric: did you add the hcatalog jar? [21:32:28] yeah, I did the ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar; [21:32:36] hm an djsonserde not found? [21:32:38] hmm [21:32:38] and I can't even desc webrequest, it gives me that error [21:32:41] maybe this is a beeline thing. [21:32:47] are you using beeline or hive cli? [21:32:52] hive (default)> desc wmf_raw.webrequest; [21:32:52] FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.ClassNotFoundException Class org.apache.hive.hcatalog.data.JsonSerDe not found [21:32:55] hive [21:33:25] hmmm [21:33:28] it worked fo rme.... [21:34:13] milimetric: from what server? [21:34:18] stat1004 [21:34:22] me too [21:34:22] sudo -u analytics hive [21:34:27] ok trying that [21:35:01] heh, milimetric it is 100% fine for me [21:35:07] wtffffff [21:35:41] this literally happens to me every time, and when I tell you guys at some point you're like "oh you didn't do XYZ" and then it works and we're all upset [21:36:08] (only with that table on wmf_raw, nothing else) [21:36:25] usually you didn't do is ADD JAR [21:36:28] you sure you ADD JAR? [21:36:56] I'm remembering something from last time!!! [21:37:03] it was something suuuper weird about timing or something [21:37:09] yes, I definitely added jar [21:37:38] heh, works now [21:37:40] ??? [21:37:49] yeah, I remember we got to the bottom of this last time [21:37:55] what was it? [21:37:58] it was some insane thing about timing or something and you all said I should just ignore it [21:38:04] ? [21:38:10] NOTHING!!! I just killed hive-cli and went back in and it worke [21:38:12] d [21:38:12] you all == me too? [21:38:15] ohh k [21:38:18] weird [21:38:28] yeah, you and jo I think [21:42:39] milimetric, qq about spark sql (or sql in general), don't need to respond now if you're busy with train: If I do SELECT blah1 UNION ALL SELECT blah2, will blah1 and blah2 keep the ordering in subsequent queries? [21:42:53] um, I think we have some kind of a problem with the TLS addition [21:43:11] I can't get wmf_raw.webrequest to return anything, anyone on a-team familiar besides jo? [21:43:23] hm [21:43:24] I did alter table add columns [21:43:34] and desc/create statement look good [21:43:35] milimetric: hat are you selecting? [21:43:36] but now: [21:43:49] milimetric, did you add fields inside a struct? [21:43:57] hive (default)> select * from wmf_raw.webrequest where year=2019 and month=10 and day=15 and hour=13 limit 10; [21:43:57] OK [21:43:57] hostname sequence dt time_firstbyte ip cache_status http_status response_size http_method uri_host uri_path uri_query content_type referer x_forwarded_for user_agent accept_language x_analytics range x_cache accept tls webrequest_source year month day hour [21:43:57] Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating hostname [21:43:57] Time taken: 0.064 seconds [21:44:11] hostname... [21:44:12] weird [21:44:19] that's because I say * [21:44:19] hmm i think select * is not your friend here [21:44:26] if I say select tls it says error to evaluate tls [21:44:35] what about tls on recent data [21:44:39] same [21:44:53] milimetric, did you add fields inside a struct? [21:45:02] no, just added a tls string column [21:45:05] ok [21:45:30] alter table wmf_raw.webrequest add columns (tls string COMMENT 'TLS information of request'); [21:45:55] hm [21:47:48] milimetric: select * will not work [21:47:56] yeah but select hostname [21:47:57] should [21:47:58] and it isn't [21:47:59] milimetric: but select [21:48:22] first, select * should definitely work, something's wrong with that table if it doesn't [21:48:34] second, yeah, I can do select x, y, z, and none of it works, old or new columns [21:48:36] milimetric: no, select * does not work for column additions [21:48:58] right, but it should - or we're not updating the metadata properly [21:49:05] milimetric: but you are right select should work [21:49:09] right [21:49:31] have we updated the underlying data? [21:49:43] I think the underlying data just has a new field as of recently [21:49:46] what if hive looks for the field and can not find it? [21:49:52] milimetric: one sec [21:51:33] milimetric: the table now [21:51:37] https://www.irccloud.com/pastebin/sFGMwkjZ/ [21:51:46] cc ottomata [21:51:49] ??? [21:51:55] when I do that it shows me a good create, watch: [21:51:58] does show has having no columns [21:52:15] https://www.irccloud.com/pastebin/Peb8rn1o/ [21:52:38] hm, is it because I did this as the analytics user? [21:53:08] 10Analytics, 10Scoring-platform-team: Investigate formal test framework for Oozie jobs - https://phabricator.wikimedia.org/T213496 (10Halfak) [21:53:11] 10Analytics, 10Dumps-Generation, 10ORES, 10Scoring-platform-team: Produce dump files for ORES scores - https://phabricator.wikimedia.org/T209739 (10Halfak) [21:53:53] 10Analytics, 10Scoring-platform-team (Research): Investigate formal test framework for Oozie jobs - https://phabricator.wikimedia.org/T213496 (10Halfak) [21:53:56] oh no, nuria you need this: [21:53:56] ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar; [21:54:36] very oddly, if you don't do that, it shows you a broken create table statement, I guess the column metadata itself is stored in a way that needs that jar to access, and if it's missing it just swallows some error and shows you part of the metadata?!! [21:54:41] so broken [21:56:30] I looked at the raw data uncompressed, in json format, and indeed on 26/10/2019 it didn't have the tls field and today it does have the tls field [21:56:53] but that in my understanding should in no way change how the table queries... [21:57:04] milimetric: it is aapermits error [21:57:50] milimetric: i see a permits error if you are but being analytics gives an error with "java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating hostname" [21:58:03] is that what you see cc ottomata ? [21:58:16] cause the 1st thing i would try is to drop the tls column [21:58:18] nuria: right, but that's because that's the private directory and we don't have access to it [21:58:46] milimetric: i see it too [21:58:48] I can try dropping it, seems fine, but I'm gonna guess we'll be in the same situation [21:58:56] hang on [21:59:06] nuria: dan and I are runnign hive as analytics user [21:59:16] batcave? [21:59:17] ottomata: ya, i saw that and did same [21:59:21] 1. it shouldn't be private, that is a mistake. the dirs are group owned by analytics-privatedata-users [21:59:29] we shoudl be able to access not as analytics [21:59:35] somethign is writing the data with wrong group perms [21:59:41] they are supposed to be inherited from dir [21:59:43] so we should check that [21:59:47] ottomata: but that is another problem [21:59:50] yes [21:59:54] checking somsething hang on [22:00:03] ottomata: k [22:01:55] ok [22:02:08] the alter will only apply to newly created partitions after the alter happens [22:02:18] i have a test table [22:02:21] did the same thign dan did [22:02:24] got same error [22:02:27] then i added a new partition [22:02:32] and i could select form that partition [22:02:39] but not the one from before the alter [22:03:16] hm, that's tricky [22:03:38] we have daily jobs that need to select from across this update [22:03:46] (before and after) [22:04:03] well, this is for the raw data [22:04:07] i believe for parquet it will be different [22:04:23] the main problem now is that we've only loaded up to hour 19 in wmf.webrequset [22:04:24] so [22:04:28] ottomata: that the alter applies to newer partitions makes total sense [22:04:30] hm... can I drop the old partitions and add new partitions? [22:04:34] yes [22:04:42] ottomata: we have done this before, that is why you cannot select * [22:04:49] that is for parquet [22:05:04] or selecting across partitions with different underlying schemasxz [22:05:11] select * works within a single partition [22:05:38] but wouldn't select hostname work on old partitions? [22:05:41] ottomata: right, that makes sense [22:05:58] milimetric: it should , there is something else going on, i think [22:06:08] i think it should too, something is going on with the json serde, don't know [22:06:09] but [22:06:22] i believe we can drop and recreate hour=20 for webrequest raw [22:06:26] and we'll have no impact [22:06:33] ... hm, ok, so for now you think maybe drop old partitions from today, november 6, add them back, and then try querying? [22:06:34] milimetric: i've got some commands prepped to do it [22:06:35] shall i? [22:06:41] just hour 20 is needed [22:06:46] hour=19 is already in refined table [22:07:06] coming to batcave to pair [22:07:09] ottomata: right, but we need to be able to query the whole day... oh, from refined [22:07:33] milimetric: no, refine queries per hour [22:36:51] !log successfully restarted webrequest bundle, webrequest druid daily, and webrequest druid hourly [22:36:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:37:07] I checked and the daily query will work, it just returns nulls for when the map is not there [22:37:28] and I will monitor to make sure the bundle works now, given the partition/access weirdness [22:37:31] seems ok so far [22:37:49] docs for this and other stuff is on my todo, I will get to them [22:40:13] there are some oozie alerts for data-quality metrics, I'll look into those, but the other ones are all me restarting stuff [22:42:03] (03CR) 10Nuria: "Looks good, couple nits and leaving to joseph to review spark code" (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/549115 (https://phabricator.wikimedia.org/T235486) (owner: 10Mforns) [22:48:29] (03CR) 10Mforns: [C: 04-2] "Still testing." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/547320 (https://phabricator.wikimedia.org/T235486) (owner: 10Mforns) [22:48:42] (03CR) 10Mforns: [C: 04-2] "Still testing." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/549115 (https://phabricator.wikimedia.org/T235486) (owner: 10Mforns) [23:00:41] RECOVERY - Check the last execution of hdfs-cleaner on an-coord1001 is OK: OK: Status of the systemd unit hdfs-cleaner https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:05:11] ^ cooo [23:05:15] bye byeee