[06:02:13] <wikibugs>	 10Analytics, 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Hive access for Sam Patton - https://phabricator.wikimedia.org/T248097 (10MoritzMuehlenhoff) 05Open→03Stalled
[07:55:14] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-Elukey: Run a script to check REFINE_FAILED flags daily - https://phabricator.wikimedia.org/T240230 (10elukey) Just tested the deployed refinery jars:  ` elukey@stat1004:~$ spark2-submit --class org.wikimedia.analytics.refinery.job.refine.RefineFailuresChecker /srv/dep...
[08:24:35] <wikibugs>	 10Analytics, 10Analytics-Kanban: Move systemd timer from an-coord1001 to an-launcher1001 - https://phabricator.wikimedia.org/T249593 (10elukey) p:05Triage→03High
[09:17:54] <elukey>	 !log enable refine for TwoColConflictExit (EL schema)
[09:17:56] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:10:13] * elukey early lunch!
[10:58:01] * elukey interview
[11:14:00] <joal>	 groceryheist: Hi - I'd like to have a talk with you about default resource settings for spark jobs - From my perspective you use `large` settings as default, which is probably not needed
[11:14:59] <wikibugs>	 (03PS1) 10WMDE-Fisch: Only track unique users disabling TwoColConflict [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944)
[12:05:26] <elukey>	 from superset upstream
[12:05:42] <elukey>	 - When rendering a TableViz with the legacy Druid connector, a cryptic
[12:05:45] <elukey>	 error message is raised if the query doesn't return any data. A PR #9480 to
[12:05:48] <elukey>	 address this is pending final review and merging. As this is affecting a
[12:05:51] <elukey>	 deprecated feature in Superset, this was not regarded as a blocker for this
[12:05:54] <elukey>	 release.
[12:06:16] <elukey>	 I am not aware of another way to use druid but we are probably not using the right one
[12:08:13] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "One minor optimization might be possible." (032 comments) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944) (owner: 10WMDE-Fisch)
[12:24:57] <elukey>	 joal: wow I made an interesting discovery
[12:25:02] <joal>	 ?
[12:25:11] <elukey>	 Druid can be queried by Superset using sqlalchemy
[12:25:36] <joal>	 NICE
[12:25:39] <elukey>	 I think that this is the preferred way for them
[12:25:57] <joal>	 them being?
[12:26:21] <elukey>	 upstream
[12:26:24] <joal>	 Ah
[12:26:28] <joal>	 hm
[12:27:41] <elukey>	 also using sql alchemy the SQL lab is available
[12:27:46] <elukey>	 for druid as well
[12:30:34] <wikibugs>	 (03PS2) 10WMDE-Fisch: Only track unique users disabling TwoColConflict [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944)
[12:31:18] <wikibugs>	 (03CR) 10WMDE-Fisch: Only track unique users disabling TwoColConflict (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944) (owner: 10WMDE-Fisch)
[12:36:23] <elukey>	 just asked to upstream to clarify, but I think it is sqlalchemy
[12:36:30] <elukey>	 I am going to add it to superset
[12:37:20] <joal>	 This is great :)
[12:38:40] <awight>	 elukey: Thanks again for the help: I see data coming through for TwoColConflictExit :-)
[12:38:51] <elukey>	 awight: thank you for fixing!
[12:39:44] <awight>	 FYI, I rewrote the evil, nested field as an optional packed string and will deploy that without changing the event schema again.  My plan is to post-process the string using Java/Spark.
[12:40:21] <mforns>	 hellooo team :]
[12:41:07] <elukey>	 joal: very weird, sqlalchemy works for analytics but not for public
[12:41:21] <joal>	 :S
[12:41:38] <joal>	 elukey: I assume it could be a settings about enabling SQL mode from druid?
[12:42:10] <elukey>	 joal: ah maybe we enabled it only for analytics
[12:42:21] <joal>	 elukey: possibly - can't recall
[12:42:37] <elukey>	 druid.sql.enable: true 
[12:42:39] <elukey>	 yep :)
[12:43:01] <mforns>	 joal: heya :] yesterday I wrote and tested https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/586432/ see: https://tinyurl.com/s6ydxmo
[12:43:32] <joal>	 Will review mforns :)
[12:43:40] <joal>	 Thanks a lot mforns!
[12:43:41] <mforns>	 :] thank you
[12:44:16] <elukey>	 joal: if you want to check sql lab now there is a database called Druid Analytics SQL
[12:44:50] <mforns>	 hey elukey :] wanna try to airflow a stats machine? yesterday I tried to install airflow inside a python venv but failed :[
[12:45:51] <elukey>	 mforns: ok in ~15 mins>
[12:45:52] <elukey>	 ??
[12:46:04] <mforns>	 of course! :]
[12:47:47] <joal>	 elukey: works like a charm :)
[12:48:37] <elukey>	 joal: enabling sql also to public, ok?
[12:49:12] <joal>	 elukey: I'm afraid of queries taking the thing down and preventing AQS to answer
[12:49:49] <elukey>	 might be a good point yes
[13:02:45] <wikibugs>	 10Analytics, 10Analytics-Kanban: Make spark-refine resilient to incorrectly formatted _REFINED files - https://phabricator.wikimedia.org/T246706 (10mforns)
[13:12:28] <elukey>	 mforns: sorry gimme 5 :)
[13:12:57] <mforns>	 no problemo elukey take al time
[13:18:00] <elukey>	 mforns: all right all yours
[13:18:11] <mforns>	 :D
[13:18:19] <mforns>	 bc? or from here?
[13:18:27] <elukey>	 we can start in bc
[13:18:32] <mforns>	 ok
[13:30:59] <wikibugs>	 (03CR) 10Awight: "Can we leverage the database rather than doing this in PHP memory?" (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944) (owner: 10WMDE-Fisch)
[13:41:51] <elukey>	 ottomata: one question - I am checking RefineFailuresChecker's code and I always assumed to run it via deploy-mode client.. if I run it via yarn, is it going to alarm if a refined failure flag is found in the current settings?
[13:42:56] <ottomata>	 HmmmM ah it won't no.
[13:43:12] <ottomata>	 hm, yeah i think that's why i do emails for the refinemonitor
[13:44:17] <ottomata>	 hm, i wonder if we could make a wrapper that would check the log output after the job is complete and then exit appropriately?
[13:44:36] <ottomata>	 we could make the spark job.sh wrapper do that somehow for all jobs if we could think of a smart way to do it
[13:44:44] <ottomata>	 hm.
[13:45:03] <ottomata>	 we migiht need to make our spark jobs do somehting to indicate global success or failure
[13:45:15] <ottomata>	 like writing a job failure/sucess flag, or emitting an event
[13:47:54] <elukey>	 ottomata: I am wondering if just raising an exception in scala works, it would cause the yarn job to fail and then I assume we'd alarm from the timer
[13:50:06] <ottomata>	  i don't the failure makes it back to the launcher process in that case
[13:50:17] <ottomata>	 you can check, but i don' tthikn it does
[13:50:28] <ottomata>	 you could kill the launcher process and the job will still be running in yarn
[13:50:30] <elukey>	 ah ok, so the launcher would exit zero
[13:50:33] <ottomata>	 yeah 
[13:50:39] <elukey>	 :(
[13:50:57] <elukey>	 in this particular case I think we could try deploy-mode client, should be lightweight
[13:51:13] <ottomata>	 yeah
[13:51:14] <ottomata>	 give it a try
[13:51:16] <elukey>	 super
[13:51:22] <ottomata>	 i thkn it will mostly be fine
[13:51:38] <ottomata>	 it just breaks the rule we have of 'launcher jobs don't do much work so ok let's use ganeti :p )
[13:51:49] <elukey>	 yes yes :(
[13:52:17] <ottomata>	 hm
[13:52:25] <ottomata>	 i think if we made the job do some explicit icinga stuff
[13:52:26] <ottomata>	 this could work
[13:52:29] <ottomata>	 not surue
[13:52:33] <ottomata>	 would nrpe help us here?
[13:52:39] <ottomata>	 if the spark job itself did some nrpe stuff?
[13:52:39] <ottomata>	 hm
[13:52:45] <ottomata>	 dunno ,that might be per hostt
[14:10:38] <mforns>	 a-team, today's standup is later than all other days, is that expected?
[14:10:53] <joal>	 mforns: it is, nuria has manager meeting I htink
[14:11:09] <mforns>	 oh ok thx!
[14:16:16] <ottomata>	 oh elukey  it is actually more complicated than thata
[14:16:25] <ottomata>	 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Refine#Running_Refine_in_local_or_yarn_client_mode
[14:16:47] <ottomata>	 i don't know exactly why
[14:16:48] <ottomata>	 but specifically
[14:17:12] <ottomata>	 you don't need to include the jars in extraClassPath via --files (or --jars, don't remember which refine_job.pp does)
[14:18:29] <ottomata>	 iirc correctly it won't work if you do
[14:18:51] <wikibugs>	 (03PS1) 10Mforns: Add check for corrupted (empty) flag files [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/587270 (https://phabricator.wikimedia.org/T246706)
[14:19:07] <elukey>	 ottomata: sorry I am not following :D
[14:19:48] <ottomata>	 hmm, maybe it isn't relevant for your job
[14:19:49] <ottomata>	 ok
[14:19:53] <elukey>	 is it about the code review that I just sent or the alarming? 
[14:19:58] <ottomata>	 code review
[14:19:58] <ottomata>	 sorry
[14:20:01] <elukey>	 ahh
[14:20:01] <ottomata>	 about yarn client mode
[14:20:31] <elukey>	 so I tested it manually on launcher1001 and it works, the file is picked up
[14:20:43] <ottomata>	 ok, i think it might be working because your job doesn't have to interact with hive
[14:20:45] <ottomata>	 directly
[14:20:50] <ottomata>	 nm proceed!
[14:21:29] <elukey>	 ack thanks for the check :)
[14:21:47] <elukey>	 how can you parse a code review in 5 seconds after I send one?
[14:22:07] <elukey>	 :D
[14:22:51] <ottomata>	 haha
[14:24:58] <wikibugs>	 (03CR) 10Mforns: [C: 04-2] "Still testing this with real data." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/587270 (https://phabricator.wikimedia.org/T246706) (owner: 10Mforns)
[14:38:47] <elukey>	 taking a little break, brb
[14:42:40] <wikibugs>	 (03PS2) 10Ottomata: Unify Refine transform functions to work with both legacy and new event data [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/586447 (https://phabricator.wikimedia.org/T238230)
[15:25:13] <ottomata>	 a-team o/
[15:25:48] <ottomata>	 do you know why UAAParser.java doesn't set browser_minor ?
[15:26:03] <ottomata>	 in EventLogging parser, we set that and a few extra fields like is_bot and is_mediawiki
[15:26:15] <ottomata>	 should we adapt UAParser to set these like EL does?
[15:29:56] <ottomata>	 webrequest uses agent_type
[15:30:03] <ottomata>	 perhaps we should luse that instead? 
[15:30:06] <nuria>	 ottomata: maybe we do not needed
[15:30:08] <ottomata>	 aye yai ai
[15:30:14] <ottomata>	 instead of is_bot
[15:30:21] <ottomata>	 i guess i have to be backwarads compat here...
[15:30:27] <nuria>	 ottomata:  it does not seem like we would
[15:31:04] <nuria>	 ottomata: on meeting but can talk about this later
[15:31:10] <ottomata>	 k
[15:32:50] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Wikimedia-Logstash, 10Documentation, and 3 others: Documentation of client side error logging capabilities on mediawiki - https://phabricator.wikimedia.org/T248884 (10jlinehan)
[15:33:46] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Product-Analytics, 10Product-Infrastructure-Team-Backlog, 10Epic: Session Length Metric. Web implementation - https://phabricator.wikimedia.org/T248987 (10jlinehan)
[15:34:04] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Product-Analytics, 10Epic, 10Product-Infrastructure-Team-Backlog (Kanban): Session Length Metric. Web implementation - https://phabricator.wikimedia.org/T248987 (10jlinehan)
[15:54:23] <ottomata>	 joal:  got 5 mins before standup for spark tip for me? :D
[16:01:21] <milimetric>	 a-team: sorry can’t make standup today, status is the same, working on the rfc
[16:01:27] <nuria>	 ping ottomata milimetric 
[16:13:57] <joal>	 Aouch - Internet is really bad at home :(
[16:44:42] <ottomata>	 I didn't get a post standup on user agent!
[16:44:43] <ottomata>	 ah!
[16:44:54] <ottomata>	 nuria: ?
[16:48:32] <joal>	 ottomata: if ok I'll and have diner with kids and will help after
[16:49:54] <ottomata>	 k
[16:52:06] <ottomata>	 joal:  i actually think i have something working, now just need to know what the right thing to do is!
[16:52:13] <ottomata>	 nuria:  gogin to make lunch but would love to brain bounce with ya today
[17:01:12] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Operations, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281 (10Krinkle) 05Open→03Resolved a:03Krinkle Confirmed via <https://stats.wikipedia.org/>. It now...
[17:01:20] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Operations, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281 (10Krinkle)
[17:10:42] * elukey off!
[17:40:36] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog: Develop test environment solution for MEP analytics events - https://phabricator.wikimedia.org/T238837 (10Ottomata)
[17:56:35] <joal>	 ottomata: I hear my help is not needed - correct ?
[17:57:22] <wikibugs>	 (03PS1) 10Ottomata: Add parse_user_agent transform function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/587305 (https://phabricator.wikimedia.org/T238230)
[17:57:24] <ottomata>	 i think i found a pretty easy eay!
[17:57:31] <ottomata>	 was going to brain bounce converting a map tto a struct
[17:57:52] <ottomata>	 but i got it!
[17:57:57] <ottomata>	 line 287 in https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/587305/1/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/TransformFunctions.scala
[17:59:23] <mforns>	 heya joal or ottomata I'm having kerberos problems when running refine:
[17:59:59] <mforns>	 I try sudo -u analytics kerberos-run-command analytics spark2-submit, but fails because kerberos-run-command only supports executables
[18:00:09] <ottomata>	 ?
[18:00:11] <ottomata>	 daas weird
[18:00:42] <mforns>	 If I try it with my own user, it fails also, because of permission problems, even if I think it writes only to my HDFS home folder
[18:01:26] <ottomata>	 what if you do /usr/bin/spark2-submit
[18:01:28] <joal>	 mforns: is spark-submit a personal script overriding the main one?
[18:01:28] <ottomata>	 any difference?
[18:01:28] <mforns>	 If I try putting the command in an executable script, also fails: OSError: [Errno 8] Exec format error
[18:01:56] <mforns>	 no no, I was using /usr/bin/spark2-submit, I just simplified the command here
[18:03:21] <joal>	 hm
[18:03:24] <mforns>	 how do you guys do it?
[18:03:37] <ottomata>	 i thought thaat way
[18:03:40] <joal>	 mforns: I think I have actually not used spark-submit since a long time!
[18:03:45] <ottomata>	 but actually, looking at my history on an-coord1001
[18:03:54] <ottomata>	 just sudo -u analaytlics spark2-submit
[18:04:03] <ottomata>	 i think it works because the ticket is cached
[18:04:15] <joal>	 this works as long as ticket has been initialized
[18:04:17] <joal>	 yup
[18:04:22] <ottomata>	 (in meeting now)
[18:04:28] <mforns>	 ottomata: yes, provided that the ticket exists that is fine I think, unless the job runs for a long time
[18:05:39] <mforns>	 does it need to be executed from an-coord1001?
[18:11:43] <joal>	 mforns: there is no more keytables for analytics user on stat machines
[18:12:02] <joal>	 so sudo-ing as analytics with kerberos should be done from an-coord1001 indeed
[18:12:13] <mforns>	 joal: I see the existing refine timers use hive_server_url = an-coord1001.eqiad.wmnet:10000
[18:12:23] <mforns>	 not sure this requires them to be executed from there?
[18:12:30] <joal>	 not related I think
[18:12:41] <joal>	 mforns: related to using analytics user keytab :)
[18:12:58] <mforns>	 yea
[18:30:38] <mforns>	 joal: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide#Use_Spark_2
[18:30:41] <mforns>	 maybe it's this
[18:31:06] <mforns>	 hm, still failing... :[
[18:31:19] <joal>	 mforns: shouldn't be related - is it an ooie job?
[18:31:43] <mforns>	 no, just a refine job
[18:37:55] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+2] Only track unique users disabling TwoColConflict (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944) (owner: 10WMDE-Fisch)
[18:38:19] <wikibugs>	 (03Merged) 10jenkins-bot: Only track unique users disabling TwoColConflict [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944) (owner: 10WMDE-Fisch)
[18:58:55] <ottomata>	 mforns: just curious why do you need to run it as analyics?
[18:59:18] <mforns>	 I shouldn't right?
[18:59:30] <mforns>	 but if I run it as me, it fails as well with kerberos error
[18:59:49] <mforns>	 I'm trying now to execute from an-coord and passing the principal and keytab
[19:00:01] <mforns>	 analytics keytab is only in an-coord
[19:01:10] <mforns>	 joal: when I ran it as mforns, Refine "worked", but I got _REFINE_FAILURES, when looking at logs, there were kerberos problems
[19:01:28] <joal>	 mforns: seems bizzarre :(
[19:01:50] <mforns>	 joal: OK I got it to work
[19:01:59] <mforns>	 in an-coord with principal and keytab
[19:01:59] <joal>	 mforns: Ah! could be related to the fact that refine needs to access hive using JDBC, and therefore needs a credential for metastore
[19:02:20] <mforns>	 yes, that's what I was refering to when I pasted the docs
[19:02:22] <joal>	 nice mforns 
[19:02:28] <joal>	 ah ok ok
[19:02:40] <joal>	 sorry I didn't get it :(
[19:03:22] <mforns>	 well, the docs don't say metastore, rather spark-thriftserver
[19:03:40] <mforns>	 but I saw in the logs that spark was trying to access hive with the hive principal
[19:03:48] <mforns>	 so I tried to pass that explicitly
[19:04:57] <joal>	 good call mforns 
[19:05:41] <mforns>	 :] thanks for the help
[19:08:00] <joal>	 Gone for tonight
[19:15:36] <wikibugs>	 (03CR) 10Mforns: [V: 03+2] "OK! Not without difficulty I was able to test this works! :] I think it's ready for review." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/587270 (https://phabricator.wikimedia.org/T246706) (owner: 10Mforns)
[19:51:48] <wikibugs>	 (03CR) 10Awight: Only track unique users disabling TwoColConflict (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/587232 (https://phabricator.wikimedia.org/T247944) (owner: 10WMDE-Fisch)
[20:32:21] <mforns>	 bye all!
[21:42:18] <wikibugs>	 10Analytics, 10Growth-Team, 10Product-Analytics: Growth: validate that data is purged after 270 days - https://phabricator.wikimedia.org/T249666 (10MMiller_WMF)
[21:42:28] <wikibugs>	 10Analytics, 10Growth-Team, 10Product-Analytics, 10Patch-For-Review: Growth: implement wider data purge window - https://phabricator.wikimedia.org/T237124 (10MMiller_WMF) 05Open→03Resolved Thank you!  Now that this is running, I filed {T249666} so that we remember to validate that the purging is happen...
[22:09:35] <icinga-wm>	 PROBLEM - Check the last execution of reportupdater-published_cx2_translations_mysql on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit reportupdater-published_cx2_translations_mysql https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:03:35] <icinga-wm>	 RECOVERY - Check the last execution of reportupdater-published_cx2_translations_mysql on an-launcher1001 is OK: OK: Status of the systemd unit reportupdater-published_cx2_translations_mysql https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers