[00:44:23] 10Analytics: Create table in hive with a continent lookup for countries - https://phabricator.wikimedia.org/T127995 (10Nuria) @Chtnnh sorry, the label here was a mistake [02:07:05] PROBLEM - Check the last execution of archive-maxmind-geoip-database on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:18:17] RECOVERY - Check the last execution of archive-maxmind-geoip-database on stat1007 is OK: OK: Status of the systemd unit archive-maxmind-geoip-database https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:32:18] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-extensions-CollaborationKit, 10Multimedia: Decide on JSON validation library - https://phabricator.wikimedia.org/T147137 (10Krinkle) [07:31:17] hello everybody, I am going to the dentist now, will be online in ~2h more or less [08:16:12] 10Analytics, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 6 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10fgiunchedi) [08:47:06] FYI, there's an Icinga warning for "HDFS corrupt blocks" on an-master1001 [08:47:40] Hi moritzm - Thanks for the ping - Let's wait for elukey :) [09:06:24] Hi, I'm writing a hive query in which a tiny (a few thousand rows selected) eventlogging table needs to be joined to mediawiki_revision_create, three times to get metadata about various revisions. For some reason the HQL is eluding me, I haven't found a nice way to structure this. Does anyone have an example in mind, which I can study? [09:07:38] maybe this is more of a mysql job, since the rev_id fields are more efficient to join there? [09:08:55] on the other hand, I'm quite happy to query multiple wiki dbs through one table rather than split this up by host. [09:26:43] awight: Hi - I wonder about the 3-times join - Wouldn't a single join be sufficient? - but this is rather a detail [09:28:04] joal: That might be where I'm going wrong. The event contains 2 revision IDs, and I want to join m_r_c.rev_id on those two IDs, then I want a third row where m_r_c.rev_parent_id matches one of the IDs. [09:28:06] awight: Also, I suggest using Spark instead of Hive for small-ish tables (revision-create is not very big if ou take only a small portion of time) [09:28:16] also, an-worker1083 is down, I can't even connect to the serial console [09:28:43] ack moritzm :S [09:29:58] joal: This is for a one-off report rather than a periodic job. Spark still makes sense for that use case? [09:30:12] awight: spark is easier than hive IMO [09:30:12] :) [09:31:42] awight: on statX, use: spark2-shell --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 16G --conf spark.dynamicAllocation.maxExecutors=64 [09:31:56] rad, thank you [09:31:59] awight: if ou prefer python replace spark2-shell by pyspark2 [09:32:40] awight: then when thing has started, you can do: spark.sql(""" My sql with carriage returns""").show() [09:33:12] awight: small overhead compare to hive, and usually a lot faster for relatively small queries [09:33:47] I like the sounds of that. At the end of the end, TSV output is all I need. [09:33:49] awight: about your request - Do you have a time-limit about the various revisions creation [09:34:22] awight: spark.sql("blah").write.format("tsv").save("path") [09:34:24] Unfortunately not. I'm happy to consider only 2019 and 2020, but that doesn't narrow it down by much. [09:34:32] ri [09:34:50] awight: very recent or not really? [09:35:11] meaning, revision_create or mediawiki_history? [09:35:21] also, which metadata are ou interested in? [09:35:25] awight: --^ [09:35:50] * joal is trying to provide a not-too-bad advise [09:36:33] tl;dr, I'm surveying edit conflicts from Feb 2020, but the base revision could be a few months or years old. It's okay to disregard some random-ish subset with older base revisions. Metadata from either of those tables is fine for my purposes. Just general stuff about the author, UA, edit summary... [09:39:34] awight: the I'd do it: define 2 dataset containing revisions for rev_ids in set 1 and 2 - Those should be small-ish, therefore you can cache them in spark [09:40:14] awight: Then extract from dataset 1 the Ids to generate dataset 3, and get it (with cache as well) [09:40:26] Finally join those 3 datasets and write [09:41:25] joal: I think I see what you mean, it makes sense. Thanks! I might drop a paste link for sanity checking, shortly. [09:41:36] ack awight :) [09:43:21] back! [09:45:09] Hi elukey :) [09:45:29] I hope you feel good - some ops awaiting :S [09:45:52] dentist on Monday morning is not a great start of the week :D [09:46:32] so an-worker1083 is down, it is probably the cause of the corrupt blocks alarm [09:48:03] ack, I didn't open a task for 1083 yet, wasn't sure whether it was WIP of some sort [09:48:04] makes sense - [09:48:08] can't even connect via mgmt [09:48:41] moritzm: I am un mgmt, there seems to be reports of CPU soft lock ups [09:48:48] I just issued a powercycle [09:49:11] ah, good. earlier the morning all connections to the serial even timed out [09:51:07] thanks for checking! [09:51:41] stat1007 is also incredibly overloaded [09:51:41] sigh [09:52:09] dsaez: good morning! Are you there by any chance? [09:52:53] moritzm: I'd need to evolve the systemd per user slice memory limits to something related to the whole host, the current limits don't work :( [09:53:16] something like a slice but for all users, or possibly something that periodically collects processes and adds them to a cgroup with limits [09:58:02] joal: replication for the host down started a while ago, still in progress https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&fullscreen&panelId=41 [09:58:11] but it sounds in recovery [10:01:31] super elukey - Thanks a lot [10:03:57] also joal I didn't solve the problem between oozie and hive in bigtop, the hue idea was of course not right. For some reason that I don't get, when oozie issues a hive2 action to the hive server 2, there is some problem when connecting to the metastore, namely lack of Kerberos credentials [10:04:18] weird! [10:04:22] I'll do more tests today, but so far it is only oozie that shows this behavior [10:04:52] I filed https://issues.apache.org/jira/browse/BIGTOP-3317 for the sharedlib thing, my patch was merged [10:04:55] elukey: could be a change in how creds need to be passed - I'll study oozie doc this afternoon (currently code-reviweing) [10:05:01] but IIUC it will not be backported to 1.4 [10:05:25] joal: ah yes sure, didn't mean to disturb sorry! Just wanted to update you :) [10:05:28] let's sync later on [10:05:30] np elukey :) [10:05:35] feel free to drop me to /dev/null [10:14:22] PROBLEM - Check the last execution of wikimedia-discovery-golden on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:14:52] I never do that, I just sometimes do some buffering elukey :) [10:15:10] * joal gone for ~1h - back in a bit [10:16:20] PROBLEM - Check the last execution of archive-maxmind-geoip-database on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:25:36] RECOVERY - Check the last execution of wikimedia-discovery-golden on stat1007 is OK: OK: Status of the systemd unit wikimedia-discovery-golden https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:27:30] RECOVERY - Check the last execution of archive-maxmind-geoip-database on stat1007 is OK: OK: Status of the systemd unit archive-maxmind-geoip-database https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:37:58] 10Analytics, 10Operations, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10MoritzMuehlenhoff) Looks good to me, what does the "Set the above two groups as admin groups for the stat100x roles." refer to? There are three groups... [10:39:00] 10Analytics, 10Operations, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) >>! In T246578#5932003, @MoritzMuehlenhoff wrote: > Looks good to me, what does the "Set the above two groups as admin groups for the stat100x... [10:39:15] 10Analytics, 10Operations, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) [11:05:06] 10Analytics, 10Analytics-Kanban: Unify puppet roles for stat and notebook hosts - https://phabricator.wikimedia.org/T243934 (10nshahquinn-wmf) >>! In T243934#5906856, @Ottomata wrote: > Then we'd just have: > > - `analytics-users` - all stat boxes + mysql analytics dbs > - `analytics-privatedata-users` - all... [11:09:49] joal: hellooo do you have a couple of mins in the batcave when you're back? :) [11:10:42] 10Analytics, 10Analytics-Kanban: Unify puppet roles for stat and notebook hosts - https://phabricator.wikimedia.org/T243934 (10elukey) >>! In T243934#5932167, @nshahquinn-wmf wrote: >>>! In T243934#5906856, @Ottomata wrote: >> Then we'd just have: >> >> - `analytics-users` - all stat boxes + mysql analytics d... [11:30:19] 10Analytics: kinit "Failed to store credentials" error - https://phabricator.wikimedia.org/T246151 (10dr0ptp4kt) It works again, thank you. [12:17:02] * elukey lunch! [12:51:01] heya fdans - sorry I missed the ping [12:51:09] fdans: I'm here when you want :) [12:51:14] joal: helloo! [12:51:18] joal: now? [12:51:21] sure [13:02:33] fdans: back I am! [13:02:49] joal: let's batcave! [13:02:56] I'm in :) [13:18:03] going to restart the hadoop masters for openjdk upgrades [13:18:23] ack elukey - Currently working with fdans on something, no big deal to lose it :) [13:18:46] joal: do you prefer me to wait? In theory nothing should stop though [13:18:53] nope - all good [13:19:03] ack [13:30:41] masters restarted! [13:31:27] joal: No rush or obligation. But I'm pretty sure I did this in the worst possible way, and I don't see how to elegantly join multiple times: https://gitlab.com/adamwight/conflict-query/-/blob/master/src/main/scala/ConflictApp.scala#L42-45 [13:32:19] will look in a bit awight [13:51:08] (line numbers no longer apply) [13:51:28] ack awight [14:01:58] joal wdqs meeting? [14:02:04] YES ! [14:08:23] 10Analytics, 10Operations, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10nshahquinn-wmf) Ah, sorry for commenting on an outdated version! >>! In T243934#5932193, @elukey wrote: >>>! In T243934#5932167, @nshahquinn-wmf wrote... [14:13:09] (03CR) 10Abijeet Patro: [V: 03+2] "recheck" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/576042 (owner: 10L10n-bot) [14:13:12] o/ [14:16:48] (03CR) 10Abijeet Patro: "recheck" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/576042 (owner: 10L10n-bot) [14:19:01] (03CR) 10Abijeet Patro: "Needs CR+2" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/576042 (owner: 10L10n-bot) [14:20:23] 10Analytics, 10Operations, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) >>! In T246578#5932587, @nshahquinn-wmf wrote: > Ah, sorry for commenting on an outdated version! > >>>! In T243934#5932193, @elukey wrote: >>... [14:23:33] so on BigTop it seems that oozie doesn't use beeline for hive2 actions [14:23:44] I just realized that in the logs I don't see traces of beeline [14:23:55] and that explains the errors, namely failed auth to the Metastore [14:24:19] or maybe hive 2.x changed something in handling jdbc security [14:26:21] heya teammmm [14:32:15] HEELLLOOOO [14:56:11] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10serviceops, 10Patch-For-Review: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10Ottomata) [15:04:47] (03CR) 10Joal: [C: 04-1] "A bunch of comments on datasets, one probable bug and some minor consistency stuff." (0329 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/562368 (https://phabricator.wikimedia.org/T238361) (owner: 10Nuria) [15:05:26] ottomata, milimetric : would you have a minute to talk about MDP (modern Data Platform?) [15:05:50] joal: yes, give me like 5 minutes though [15:05:55] sure [15:06:09] hm, would really like to do a deploy before standup, lemme try to get that done first, is that ok? [15:06:16] up [15:09:06] (03CR) 10Joal: Stop using the jar file in the WikidataArticlePlaceholderMetrics (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572734 (https://phabricator.wikimedia.org/T236895) (owner: 10Ladsgroup) [15:19:37] joal: omw cave [15:22:45] joal: can i share the new wikidata hive tables with my team or is there still testing etc. going on? i'm very excited about them... [15:25:36] joal: milimetric deployment 1/N done, i can tlak now if you all are there [15:25:55] I'm there but Jo seems afk for a sec [15:25:58] !log setting new user-slice global memory/cpu settings on stat1007 [15:25:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:26:05] k will wait for joal [15:30:50] joal: joseph I'm sure you can do better than this [15:31:10] !log setting new user.slice global memory/cpu settings on notebook1003 [15:31:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:31:15] https://usercontent.irccloud-cdn.com/file/ELE0xGDW/Screen%20Shot%202020-03-02%20at%204.30.34%20PM.png [15:31:21] (the last one) [15:32:26] joal / ottomata: I shared my top secret rough draft with you, I have a lot of thoughts since last we talked, but they're mostly about lame non-tech things. [15:32:51] Wow sorry miswed pings - joining [15:33:26] ottomata: k, both here [15:38:15] !log apply new settings to all stat/notebooks [15:38:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:01:53] ping nuria :) [16:01:53] nuria: standuuup [16:15:46] 10Analytics, 10Operations, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10Ottomata) > Reduce the number of POSIX groups to: analtyics, analytics-wmde-users and analytics-privatedata Do you mean `analytics-users` and `analyti... [16:18:49] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10serviceops, 10Patch-For-Review: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10Ottomata) [16:19:00] 10Analytics, 10Operations, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) >>! In T246578#5933122, @Ottomata wrote: >> Reduce the number of POSIX groups to: analtyics, analytics-wmde-users and analytics-privatedata >... [16:19:26] 10Analytics, 10Operations, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) [16:27:18] isaacj: which one are you talking about? [16:29:15] isaacj: the main one (wmf.wikidata_entity) is ready, the item_page_link is not productionized yet (CR to be finalized and merged, almost there :) [16:42:20] 10Analytics, 10Operations, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10Milimetric) p:05Triage→03High [16:46:06] 10Analytics: [data quality alarms] try hourly granularity for traffic entropy metrics - https://phabricator.wikimedia.org/T246680 (10mforns) [16:47:02] 10Analytics: [data quality alarms] try hourly granularity for traffic entropy metrics - https://phabricator.wikimedia.org/T246680 (10mforns) [16:47:04] 10Analytics, 10Analytics-Kanban: Data Quality Alarms - https://phabricator.wikimedia.org/T198986 (10mforns) [16:48:18] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10DiscussionTools, and 4 others: New EventLogging queue doesn't log events in window.unload - https://phabricator.wikimedia.org/T246382 (10Milimetric) p:05Triage→03High a:05DLynch→03Milimetric [16:48:54] 10Analytics, 10Analytics-Kanban: Virtual pageviews should set access_type to mobile if webhost is a mobile one - https://phabricator.wikimedia.org/T246309 (10Milimetric) a:03fdans [16:49:15] 10Analytics, 10Analytics-Kanban: Virtual pageviews should set access_type to mobile if webhost is a mobile one - https://phabricator.wikimedia.org/T246309 (10Milimetric) p:05Triage→03High [16:51:15] 10Analytics: [data quality alarms] Reduce the K to generate more reports - https://phabricator.wikimedia.org/T246682 (10mforns) [16:51:17] 10Analytics: Refine should DROP IF EXISTS before ADD PARTITION - https://phabricator.wikimedia.org/T246235 (10Milimetric) p:05Triage→03Medium [16:51:41] 10Analytics: [data quality alarms] Reduce the K to generate more reports - https://phabricator.wikimedia.org/T246682 (10mforns) [16:51:43] 10Analytics, 10Analytics-Kanban: Data Quality Alarms - https://phabricator.wikimedia.org/T198986 (10mforns) [16:52:54] 10Analytics: Should reportupdater Pingback reports be refactored? - https://phabricator.wikimedia.org/T246154 (10Milimetric) p:05Triage→03Medium [16:53:03] 10Analytics: [data quality alarms] add traffic metrics to test whether they help - https://phabricator.wikimedia.org/T246683 (10mforns) [16:53:15] 10Analytics: [data quality alarms] add traffic metrics to test whether they help - https://phabricator.wikimedia.org/T246683 (10mforns) [16:53:17] 10Analytics, 10Analytics-Kanban: Data Quality Alarms - https://phabricator.wikimedia.org/T198986 (10mforns) [16:55:36] 10Analytics, 10Analytics-Kanban, 10Epic, 10Product-Analytics (Kanban): Spark sessions can provision kerberos tickets in a more predictable manner - https://phabricator.wikimedia.org/T246132 (10Milimetric) p:05Triage→03High a:05Ottomata→03elukey [16:56:11] 10Analytics: Check home/HDFS leftovers of flemmerich - https://phabricator.wikimedia.org/T246070 (10Milimetric) p:05Triage→03High [16:57:46] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Desktop Improvements, and 6 others: Enable client side error logging in prod for small wiki - https://phabricator.wikimedia.org/T246030 (10Milimetric) p:05Triage→03High [17:02:33] 10Analytics: [data quality alarms] add traffic metrics to test whether they help - https://phabricator.wikimedia.org/T246683 (10Milimetric) p:05Triage→03Medium [17:02:42] 10Analytics: [data quality alarms] Reduce the K to generate more reports - https://phabricator.wikimedia.org/T246682 (10Milimetric) p:05Triage→03High [17:02:52] 10Analytics: [data quality alarms] try hourly granularity for traffic entropy metrics - https://phabricator.wikimedia.org/T246680 (10Milimetric) p:05Triage→03Medium [17:04:01] joal: yeah, both wikidata_entity and item_page_link. sounds good then -- I noticed that item_page_link didn't have data yet but excited for when it's available! [17:05:57] 10Analytics, 10Operations, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Milimetric) p:05Triage→03Medium We should have a meeting about this towards the end of this quarter / beginning of next.... [17:07:26] 10Analytics, 10Product-Analytics: Spark application UI shows data for different application - https://phabricator.wikimedia.org/T245892 (10Milimetric) 05Open→03Declined We tried multiple time to reproduce, and we couldn't but we still think it happened because we trust Neil. As we informally chatted in th... [17:07:28] 10Analytics, 10Epic, 10Product-Analytics (Kanban): Analysts cannot reliably use wmfdata to run SQL queries against Hive databases - https://phabricator.wikimedia.org/T245891 (10Milimetric) [17:13:00] 10Analytics, 10Product-Analytics (Kanban): Spark applications crash when running large queries - https://phabricator.wikimedia.org/T245896 (10Milimetric) 05Open→03Declined True, but hopefully we can mitigate this with knowledge about Spark, settings bundles, etc. There is no silver bullet that'll make Spa... [17:13:02] 10Analytics, 10Epic, 10Product-Analytics (Kanban): Analysts cannot reliably use wmfdata to run SQL queries against Hive databases - https://phabricator.wikimedia.org/T245891 (10Milimetric) [17:13:39] 10Analytics, 10Epic, 10Product-Analytics (Kanban): Analysts cannot reliably use wmfdata to run SQL queries against Hive databases - https://phabricator.wikimedia.org/T245891 (10Milimetric) monitoring this for any additional subtasks [17:16:13] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-extensions-CollaborationKit, 10Multimedia: Decide on JSON validation library - https://phabricator.wikimedia.org/T147137 (10Milimetric) The server-side EventLogging validation will be deprecated in favor of ingestion through EventGate, which has its own va... [17:38:37] 10Analytics, 10Product-Analytics (Kanban): wmfdata cannot recover from a crashed Spark session - https://phabricator.wikimedia.org/T245713 (10kzimmerman) [17:47:04] 10Analytics, 10Epic, 10Product-Analytics (Kanban): Analysts cannot reliably use wmfdata to run SQL queries against Hive databases - https://phabricator.wikimedia.org/T245891 (10kzimmerman) a:05kzimmerman→03nshahquinn-wmf Thanks @Milimetric! Reassigning to @nshahquinn-wmf, who's continuing work on this.... [18:02:27] gone for finer, back after [18:05:28] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10CPT Initiatives (Modern Event Platform (TEC2)), and 2 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Ottomata) [18:07:43] ottomata: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/576099/ /o\ [18:10:44] also I am now seeing some parsing errors in hive 2 [18:10:48] of our scripts [18:10:56] for the moment something mild, but interesting [18:11:01] I'll get all info somewhere in the task [18:13:29] so it means that moving to hive2 may break users [18:16:14] hmm elukey maybe we can at least make sure all the refinery stuff is good? [18:16:16] and RU? [18:18:54] ottomata: oh yes sure, I'll make sure that all works [18:19:02] it was a general thought [18:19:25] I was trying to check an upgraade script doc from upstream but didn't find much [18:21:25] 10Analytics, 10Operations, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10nshahquinn-wmf) >>! In T246578#5932606, @elukey wrote: > There are other use cases for people using the stat boxes, that often don't involve private da... [18:32:23] * elukey afk for a bit! [19:00:20] nuria: is now a good time to sync? [19:05:43] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10serviceops, 10Patch-For-Review: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10Ottomata) [19:09:04] joal: i have meetings until late today, let me push another patch and we can sync up tomorrow? [19:09:15] sure nuria - no problem [19:09:24] good luck with meetings nuria :S [19:09:54] joal: sorry about that. Will do a formal pass on via e-mail later [19:10:15] no worries at all - I have stuff to do :) [19:13:13] joal: found the issue with oozie/hive, it was the DBTokenStore setting :( [19:13:31] I have seen the patch, and wondered about that [19:13:35] elukey: --^ [19:13:43] so how come ???? This is very weird [19:13:59] it might be a bug with hive 2.x, not sure [19:14:12] :S [19:14:22] the setting, IIUC, is meant for hive metastores in HA, so not really our use case [19:14:29] plus there was that weird zookeeper setting [19:14:46] tomorrow I'll restart the hive daemons on an-coord1001, let's see if the kerberos sporadic issues come back [19:15:45] now I am seeing issues with the hive sql parser, but format changes are expected in major version changes [19:17:30] yessir - please let me help when you want! [19:18:26] extract_data_loss.hql seems the script most affected for now [19:18:53] I had to add `` around AS statements (select bla AS `bla2` for example) [19:18:56] and now I see [19:18:58] org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 24:6 cannot recognize input near 'CONCAT' '(' 'CAST' in expression specification [19:19:49] the concat+cast etc.. is elaborated so if something changed in the language parse it makes sense that fails :) [19:21:06] reading https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF [19:21:21] A || B seems to be a shorthand for concat(A,B) [19:21:23] really nice [19:37:26] anyway, going to start again tomorrow :) [19:37:39] * elukey afk! [19:37:43] bye elukey [19:38:56] 10Analytics, 10Epic, 10Product-Analytics (Kanban): Analysts cannot reliably use wmfdata to run SQL queries against Hive databases - https://phabricator.wikimedia.org/T245891 (10Nuria) It will be worth thinking whether the goal of having 1 library (wmfdata) that is a fit for all use cases is an suitable one t... [19:45:55] ottomata: heya - question for yyou regarding refine [19:46:27] ya? [19:46:45] mediawiki-events (revision-create etc) are quite late (last hour = 10) [19:46:48] is that expected? [19:46:55] hm, no [19:47:06] not expecteed [19:47:10] that's 9 hours late? [19:47:15] you mean in camus or refined? [19:47:18] via camus? [19:47:19] raw? [19:47:22] seems to have happened after the error we received this morning [19:47:26] refined [19:47:35] ok so in raw is ok [19:48:23] what do most recent refine job logs say? [19:48:28] about success vs. failure? [19:48:33] and...is it all events? [19:48:41] looking for logs [19:48:46] I checked some of them, not all [19:50:19] ottomata: error in mediawiki-refine log [19:50:41] seems to be the same at every hour [19:52:29] ok so the job is actuallyl faillilng? [19:52:33] yessir [19:52:36] oh yeah big time [19:52:40] not even on a dataset [19:52:41] since the first email we received [19:52:42] just the whole job [19:52:45] yup [19:52:55] at org.wikimedia.analytics.refinery.job.refine.RefineTarget.readMTimeFromFile(RefineTarget.scala:168) [19:53:02] looking at logs - first time I see that [19:53:05] hm, elukey said he did something with the success files ya? [19:53:25] elukey said he removed flags yes [19:53:41] probably shoudl guard against this... [19:53:48] but if thhe file is empty i think this might happen [19:53:54] ? [19:53:57] there is no refined at timestamp in the _REFINED file [19:54:03] i think anywa [19:54:05] looking [19:54:25] oh - so there is an empty refined flag? weird [19:54:31] am guessing [19:54:35] right [19:55:27] it doesn't tell us which one!!! :) [19:55:36] trying to get it as well [19:56:00] it actually could be any of the _REFINED files in the last --since hours [19:56:04] /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2020/month=3/day=2/hour=10/_REFINED [19:56:11] yah there is that one [19:56:11] MWARF [19:56:25] That's all for today [19:56:37] hdfs dfs -du /wmf/data/event/*/datacenter=eqiad/year=2020/month=3/day=2/*/_REFINED [19:57:05] ottomata: removing that file should fix, shouldn't it? [19:57:15] it should [19:57:23] ottomata: doing so and waiting for next run [19:57:27] joal [19:57:28] wait [19:57:29] 0 0 /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2020/month=3/day=2/hour=10/_REFINED [19:57:32] Thanks a lot for the help in debugging [19:57:38] ? [19:57:47] /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2020/month=3/day=2/hour=10/_REFINED has 0 size [19:57:57] try just removing that one [19:58:09] that was what I was planning to do :) [19:58:39] !log Remove faulty _REFINED file at /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2020/month=3/day=2/hour=10/_REFINED [19:58:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:59:25] OH that file [19:59:29] sorry thoughth you were going to remove all of them [19:59:34] yes proceed joal [19:59:42] Oh no! Not going to reprocess everything! [19:59:47] ok gerat! proceed! [20:00:10] And I guess ottomata since the system crash at an unexpected place it doesn't proceed with other folders [20:00:17] Will create a ticket [20:00:30] yeah [20:00:32] for sure [20:00:38] its crashing before even launching any spark stuff [20:00:55] thanks joal [20:02:24] 10Analytics: Make spark-refine resilient to incorrectly formatted _REFINED files - https://phabricator.wikimedia.org/T246706 (10JAllemandou) [20:02:28] ottomata: --^ [20:02:32] ty [20:03:45] ottomata: shall I manually restart a [20:03:52] an exec, ot wait for the next? [20:04:39] joal: either way! [20:04:59] if you wait, i'd tial the log file so you know when it is done and which app log to check [20:04:59] ok :) [20:05:16] ok let's do that [20:38:48] ottomata: I can tell from the time it takes that some stuff is being done (refine) [21:04:07] (03CR) 10Fdans: [C: 03+2] Localisation updates from https://translatewiki.net. [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/576042 (owner: 10L10n-bot) [21:12:49] 10Analytics, 10Operations, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10LilyOfTheWest) @Milimetric that is a good point. @Miriam I suggest replacing "highly anonymized" in the task description w... [21:16:15] 10Analytics, 10Operations, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Milimetric) The internal use cases would be nice to support, and I think we can discuss that separately from how much we tru... [21:19:45] 10Analytics, 10Operations, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10leila) @Milimetric that is a good point. @Miriam I suggest replacing "highly anonymized" in the task description with "suf... [21:24:02] joal: :) [21:34:51] 10Analytics, 10Operations, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Nuria) What is the number of users this potential system would serve? 10/100? [21:41:10] 10Analytics, 10Operations, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10leila) @Nuria can you help me understand in what sense the answer to this question is important? Is it about RAM and Storage... [21:48:57] Hey folks. I'm looking at MediaWiki history. Does this have historical information about page protection in it anywhere? If not, I'd like to file a task and would welcome suggestions for how to tag it :) [21:49:54] 10Analytics, 10Operations, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Nuria) This ask, in terms of infrastructure is a significant one and we would like to e how many users are benefiting from i... [21:52:14] halfak: give me an example of what page protections are, like "pages in which editing is restricted for vandalism reasons"? [21:55:08] Right. Page protections are time bounded event. A page can have different levels of protection. [21:55:27] Protection can appear and disappear independently of a revision. [21:56:00] This paper provides a great description of them and how to work them out from public data: https://opensym.org/os2015/proceedings-files/p403-hill.pdf [22:11:14] halfak: i see, ok, ya, i do not think that is in there . The dataset has gotten quite complicated to produce and we do not think we can add any more dimensions to it soon, but do file a ticket. There are couple other requests for more dimensions and we are thinking of what other ways we could satisfy those [22:12:01] Cool. Will file. I certainly understand that this is pretty complicated. [22:12:18] Any tips on tagging? Maybe you could link me to a ticket you like and I can make it look like that one. [22:19:53] 10Analytics, 10Operations, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10leila) >>! In T245833#5934960, @Nuria wrote: > This ask, in terms of infrastructure is a significant one and we would like t... [22:21:34] (03CR) 10Nuria: Stop using the jar file in the WikidataArticlePlaceholderMetrics (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572734 (https://phabricator.wikimedia.org/T236895) (owner: 10Ladsgroup) [22:32:25] I wonder if I should make a sub-task of https://phabricator.wikimedia.org/T221828 [22:38:53] 10Analytics: Add historical page protection status to MediaWiki history - https://phabricator.wikimedia.org/T246723 (10Halfak) [22:39:15] OK I hope that is useful. I'm off for the day. Take care, A-team :) [22:39:37] 10Analytics, 10Analytics-Cluster: Hue doesn't show executor details - https://phabricator.wikimedia.org/T246724 (10awight) [23:44:14] 10Analytics, 10Analytics-Cluster: Hue doesn't show executor details - https://phabricator.wikimedia.org/T246724 (10Nuria) You are trying to run an oozie coordinator/workflow? cause you can get logs by doing yarn logs -applicationId