[06:02:13] 10Analytics, 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Hive access for Sam Patton - https://phabricator.wikimedia.org/T248097 (10MoritzMuehlenhoff) 05Open→03Stalled [07:55:14] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Run a script to check REFINE_FAILED flags daily - https://phabricator.wikimedia.org/T240230 (10elukey) Just tested the deployed refinery jars: ` elukey@stat1004:~$ spark2-submit --class org.wikimedia.analytics.refinery.job.refine.RefineFailuresChecker /srv/dep... [08:24:35] 10Analytics, 10Analytics-Kanban: Move systemd timer from an-coord1001 to an-launcher1001 - https://phabricator.wikimedia.org/T249593 (10elukey) p:05Triage→03High [09:17:54] !log enable refine for TwoColConflictExit (EL schema) [09:17:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:10:13] * elukey early lunch! [10:58:01] * elukey interview [11:14:00] groceryheist: Hi - I'd like to have a talk with you about default resource settings for spark jobs - From my perspective you use `large` settings as default, which is probably not needed [11:14:59] (03PS1) 10WMDE-Fisch: Only track unique users disabling TwoColConflict [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944) [12:05:26] from superset upstream [12:05:42] - When rendering a TableViz with the legacy Druid connector, a cryptic [12:05:45] error message is raised if the query doesn't return any data. A PR #9480 to [12:05:48] address this is pending final review and merging. As this is affecting a [12:05:51] deprecated feature in Superset, this was not regarded as a blocker for this [12:05:54] release. [12:06:16] I am not aware of another way to use druid but we are probably not using the right one [12:08:13] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "One minor optimization might be possible." (032 comments) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944) (owner: 10WMDE-Fisch) [12:24:57] joal: wow I made an interesting discovery [12:25:02] ? [12:25:11] Druid can be queried by Superset using sqlalchemy [12:25:36] NICE [12:25:39] I think that this is the preferred way for them [12:25:57] them being? [12:26:21] upstream [12:26:24] Ah [12:26:28] hm [12:27:41] also using sql alchemy the SQL lab is available [12:27:46] for druid as well [12:30:34] (03PS2) 10WMDE-Fisch: Only track unique users disabling TwoColConflict [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944) [12:31:18] (03CR) 10WMDE-Fisch: Only track unique users disabling TwoColConflict (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944) (owner: 10WMDE-Fisch) [12:36:23] just asked to upstream to clarify, but I think it is sqlalchemy [12:36:30] I am going to add it to superset [12:37:20] This is great :) [12:38:40] elukey: Thanks again for the help: I see data coming through for TwoColConflictExit :-) [12:38:51] awight: thank you for fixing! [12:39:44] FYI, I rewrote the evil, nested field as an optional packed string and will deploy that without changing the event schema again. My plan is to post-process the string using Java/Spark. [12:40:21] hellooo team :] [12:41:07] joal: very weird, sqlalchemy works for analytics but not for public [12:41:21] :S [12:41:38] elukey: I assume it could be a settings about enabling SQL mode from druid? [12:42:10] joal: ah maybe we enabled it only for analytics [12:42:21] elukey: possibly - can't recall [12:42:37] druid.sql.enable: true [12:42:39] yep :) [12:43:01] joal: heya :] yesterday I wrote and tested https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/586432/ see: https://tinyurl.com/s6ydxmo [12:43:32] Will review mforns :) [12:43:40] Thanks a lot mforns! [12:43:41] :] thank you [12:44:16] joal: if you want to check sql lab now there is a database called Druid Analytics SQL [12:44:50] hey elukey :] wanna try to airflow a stats machine? yesterday I tried to install airflow inside a python venv but failed :[ [12:45:51] mforns: ok in ~15 mins> [12:45:52] ?? [12:46:04] of course! :] [12:47:47] elukey: works like a charm :) [12:48:37] joal: enabling sql also to public, ok? [12:49:12] elukey: I'm afraid of queries taking the thing down and preventing AQS to answer [12:49:49] might be a good point yes [13:02:45] 10Analytics, 10Analytics-Kanban: Make spark-refine resilient to incorrectly formatted _REFINED files - https://phabricator.wikimedia.org/T246706 (10mforns) [13:12:28] mforns: sorry gimme 5 :) [13:12:57] no problemo elukey take al time [13:18:00] mforns: all right all yours [13:18:11] :D [13:18:19] bc? or from here? [13:18:27] we can start in bc [13:18:32] ok [13:30:59] (03CR) 10Awight: "Can we leverage the database rather than doing this in PHP memory?" (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944) (owner: 10WMDE-Fisch) [13:41:51] ottomata: one question - I am checking RefineFailuresChecker's code and I always assumed to run it via deploy-mode client.. if I run it via yarn, is it going to alarm if a refined failure flag is found in the current settings? [13:42:56] HmmmM ah it won't no. [13:43:12] hm, yeah i think that's why i do emails for the refinemonitor [13:44:17] hm, i wonder if we could make a wrapper that would check the log output after the job is complete and then exit appropriately? [13:44:36] we could make the spark job.sh wrapper do that somehow for all jobs if we could think of a smart way to do it [13:44:44] hm. [13:45:03] we migiht need to make our spark jobs do somehting to indicate global success or failure [13:45:15] like writing a job failure/sucess flag, or emitting an event [13:47:54] ottomata: I am wondering if just raising an exception in scala works, it would cause the yarn job to fail and then I assume we'd alarm from the timer [13:50:06] i don't the failure makes it back to the launcher process in that case [13:50:17] you can check, but i don' tthikn it does [13:50:28] you could kill the launcher process and the job will still be running in yarn [13:50:30] ah ok, so the launcher would exit zero [13:50:33] yeah [13:50:39] :( [13:50:57] in this particular case I think we could try deploy-mode client, should be lightweight [13:51:13] yeah [13:51:14] give it a try [13:51:16] super [13:51:22] i thkn it will mostly be fine [13:51:38] it just breaks the rule we have of 'launcher jobs don't do much work so ok let's use ganeti :p ) [13:51:49] yes yes :( [13:52:17] hm [13:52:25] i think if we made the job do some explicit icinga stuff [13:52:26] this could work [13:52:29] not surue [13:52:33] would nrpe help us here? [13:52:39] if the spark job itself did some nrpe stuff? [13:52:39] hm [13:52:45] dunno ,that might be per hostt [14:10:38] a-team, today's standup is later than all other days, is that expected? [14:10:53] mforns: it is, nuria has manager meeting I htink [14:11:09] oh ok thx! [14:16:16] oh elukey it is actually more complicated than thata [14:16:25] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Refine#Running_Refine_in_local_or_yarn_client_mode [14:16:47] i don't know exactly why [14:16:48] but specifically [14:17:12] you don't need to include the jars in extraClassPath via --files (or --jars, don't remember which refine_job.pp does) [14:18:29] iirc correctly it won't work if you do [14:18:51] (03PS1) 10Mforns: Add check for corrupted (empty) flag files [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/587270 (https://phabricator.wikimedia.org/T246706) [14:19:07] ottomata: sorry I am not following :D [14:19:48] hmm, maybe it isn't relevant for your job [14:19:49] ok [14:19:53] is it about the code review that I just sent or the alarming? [14:19:58] code review [14:19:58] sorry [14:20:01] ahh [14:20:01] about yarn client mode [14:20:31] so I tested it manually on launcher1001 and it works, the file is picked up [14:20:43] ok, i think it might be working because your job doesn't have to interact with hive [14:20:45] directly [14:20:50] nm proceed! [14:21:29] ack thanks for the check :) [14:21:47] how can you parse a code review in 5 seconds after I send one? [14:22:07] :D [14:22:51] haha [14:24:58] (03CR) 10Mforns: [C: 04-2] "Still testing this with real data." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/587270 (https://phabricator.wikimedia.org/T246706) (owner: 10Mforns) [14:38:47] taking a little break, brb [14:42:40] (03PS2) 10Ottomata: Unify Refine transform functions to work with both legacy and new event data [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/586447 (https://phabricator.wikimedia.org/T238230) [15:25:13] a-team o/ [15:25:48] do you know why UAAParser.java doesn't set browser_minor ? [15:26:03] in EventLogging parser, we set that and a few extra fields like is_bot and is_mediawiki [15:26:15] should we adapt UAParser to set these like EL does? [15:29:56] webrequest uses agent_type [15:30:03] perhaps we should luse that instead? [15:30:06] ottomata: maybe we do not needed [15:30:08] aye yai ai [15:30:14] instead of is_bot [15:30:21] i guess i have to be backwarads compat here... [15:30:27] ottomata: it does not seem like we would [15:31:04] ottomata: on meeting but can talk about this later [15:31:10] k [15:32:50] 10Analytics, 10Better Use Of Data, 10Wikimedia-Logstash, 10Documentation, and 3 others: Documentation of client side error logging capabilities on mediawiki - https://phabricator.wikimedia.org/T248884 (10jlinehan) [15:33:46] 10Analytics, 10Better Use Of Data, 10Product-Analytics, 10Product-Infrastructure-Team-Backlog, 10Epic: Session Length Metric. Web implementation - https://phabricator.wikimedia.org/T248987 (10jlinehan) [15:34:04] 10Analytics, 10Better Use Of Data, 10Product-Analytics, 10Epic, 10Product-Infrastructure-Team-Backlog (Kanban): Session Length Metric. Web implementation - https://phabricator.wikimedia.org/T248987 (10jlinehan) [15:54:23] joal: got 5 mins before standup for spark tip for me? :D [16:01:21] a-team: sorry can’t make standup today, status is the same, working on the rfc [16:01:27] ping ottomata milimetric [16:13:57] Aouch - Internet is really bad at home :( [16:44:42] I didn't get a post standup on user agent! [16:44:43] ah! [16:44:54] nuria: ? [16:48:32] ottomata: if ok I'll and have diner with kids and will help after [16:49:54] k [16:52:06] joal: i actually think i have something working, now just need to know what the right thing to do is! [16:52:13] nuria: gogin to make lunch but would love to brain bounce with ya today [17:01:12] 10Analytics, 10Analytics-Wikistats, 10Operations, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281 (10Krinkle) 05Open→03Resolved a:03Krinkle Confirmed via . It now... [17:01:20] 10Analytics, 10Analytics-Wikistats, 10Operations, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281 (10Krinkle) [17:10:42] * elukey off! [17:40:36] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog: Develop test environment solution for MEP analytics events - https://phabricator.wikimedia.org/T238837 (10Ottomata) [17:56:35] ottomata: I hear my help is not needed - correct ? [17:57:22] (03PS1) 10Ottomata: Add parse_user_agent transform function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/587305 (https://phabricator.wikimedia.org/T238230) [17:57:24] i think i found a pretty easy eay! [17:57:31] was going to brain bounce converting a map tto a struct [17:57:52] but i got it! [17:57:57] line 287 in https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/587305/1/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/TransformFunctions.scala [17:59:23] heya joal or ottomata I'm having kerberos problems when running refine: [17:59:59] I try sudo -u analytics kerberos-run-command analytics spark2-submit, but fails because kerberos-run-command only supports executables [18:00:09] ? [18:00:11] daas weird [18:00:42] If I try it with my own user, it fails also, because of permission problems, even if I think it writes only to my HDFS home folder [18:01:26] what if you do /usr/bin/spark2-submit [18:01:28] mforns: is spark-submit a personal script overriding the main one? [18:01:28] any difference? [18:01:28] If I try putting the command in an executable script, also fails: OSError: [Errno 8] Exec format error [18:01:56] no no, I was using /usr/bin/spark2-submit, I just simplified the command here [18:03:21] hm [18:03:24] how do you guys do it? [18:03:37] i thought thaat way [18:03:40] mforns: I think I have actually not used spark-submit since a long time! [18:03:45] but actually, looking at my history on an-coord1001 [18:03:54] just sudo -u analaytlics spark2-submit [18:04:03] i think it works because the ticket is cached [18:04:15] this works as long as ticket has been initialized [18:04:17] yup [18:04:22] (in meeting now) [18:04:28] ottomata: yes, provided that the ticket exists that is fine I think, unless the job runs for a long time [18:05:39] does it need to be executed from an-coord1001? [18:11:43] mforns: there is no more keytables for analytics user on stat machines [18:12:02] so sudo-ing as analytics with kerberos should be done from an-coord1001 indeed [18:12:13] joal: I see the existing refine timers use hive_server_url = an-coord1001.eqiad.wmnet:10000 [18:12:23] not sure this requires them to be executed from there? [18:12:30] not related I think [18:12:41] mforns: related to using analytics user keytab :) [18:12:58] yea [18:30:38] joal: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide#Use_Spark_2 [18:30:41] maybe it's this [18:31:06] hm, still failing... :[ [18:31:19] mforns: shouldn't be related - is it an ooie job? [18:31:43] no, just a refine job [18:37:55] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+2] Only track unique users disabling TwoColConflict (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944) (owner: 10WMDE-Fisch) [18:38:19] (03Merged) 10jenkins-bot: Only track unique users disabling TwoColConflict [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944) (owner: 10WMDE-Fisch) [18:58:55] mforns: just curious why do you need to run it as analyics? [18:59:18] I shouldn't right? [18:59:30] but if I run it as me, it fails as well with kerberos error [18:59:49] I'm trying now to execute from an-coord and passing the principal and keytab [19:00:01] analytics keytab is only in an-coord [19:01:10] joal: when I ran it as mforns, Refine "worked", but I got _REFINE_FAILURES, when looking at logs, there were kerberos problems [19:01:28] mforns: seems bizzarre :( [19:01:50] joal: OK I got it to work [19:01:59] in an-coord with principal and keytab [19:01:59] mforns: Ah! could be related to the fact that refine needs to access hive using JDBC, and therefore needs a credential for metastore [19:02:20] yes, that's what I was refering to when I pasted the docs [19:02:22] nice mforns [19:02:28] ah ok ok [19:02:40] sorry I didn't get it :( [19:03:22] well, the docs don't say metastore, rather spark-thriftserver [19:03:40] but I saw in the logs that spark was trying to access hive with the hive principal [19:03:48] so I tried to pass that explicitly [19:04:57] good call mforns [19:05:41] :] thanks for the help [19:08:00] Gone for tonight [19:15:36] (03CR) 10Mforns: [V: 03+2] "OK! Not without difficulty I was able to test this works! :] I think it's ready for review." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/587270 (https://phabricator.wikimedia.org/T246706) (owner: 10Mforns) [19:51:48] (03CR) 10Awight: Only track unique users disabling TwoColConflict (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944) (owner: 10WMDE-Fisch) [20:32:21] bye all! [21:42:18] 10Analytics, 10Growth-Team, 10Product-Analytics: Growth: validate that data is purged after 270 days - https://phabricator.wikimedia.org/T249666 (10MMiller_WMF) [21:42:28] 10Analytics, 10Growth-Team, 10Product-Analytics, 10Patch-For-Review: Growth: implement wider data purge window - https://phabricator.wikimedia.org/T237124 (10MMiller_WMF) 05Open→03Resolved Thank you! Now that this is running, I filed {T249666} so that we remember to validate that the purging is happen... [22:09:35] PROBLEM - Check the last execution of reportupdater-published_cx2_translations_mysql on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit reportupdater-published_cx2_translations_mysql https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:03:35] RECOVERY - Check the last execution of reportupdater-published_cx2_translations_mysql on an-launcher1001 is OK: OK: Status of the systemd unit reportupdater-published_cx2_translations_mysql https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers