[07:09:42] <elukey>	 !log manual reset-failed refinery-sqoop-whole-mediawiki.service on an-launcher1002 (job launched manually)
[07:09:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:13:00] <icinga-wm>	 RECOVERY - Check the last execution of refinery-sqoop-whole-mediawiki on an-launcher1002 is OK: OK: Status of the systemd unit refinery-sqoop-whole-mediawiki https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:19:51] <joal>	 Good morning
[07:19:57] <joal>	 Thanks elukey for the reset of timer
[07:21:09] <elukey>	 bonjour!
[07:23:50] <joal>	 elukey: I'm still having trouble sqooping the commonswiki image table :(
[07:24:16] <joal>	 I'm gonna leave it for now, trying to make progress on other projects, and then back to it
[07:26:52] <elukey>	 ack!
[07:40:26] <wikibugs>	 10Analytics, 10Analytics-Kanban: Investigate oozie banner monthly job timeouts - https://phabricator.wikimedia.org/T264358 (10JAllemandou) From https://github.com/wikimedia/analytics-refinery/blob/master/oozie/banner_activity/druid/monthly/coordinator.xml#L52, the dataset dependency is set up to wait for both:...
[07:46:09] <wikibugs>	 (03CR) 10Joal: "2 nits marked as done but actually not done, rest looks good" (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/640146 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans)
[08:20:27] <joal>	 elukey: look at that!!! https://github.com/apache/incubator-gobblin/pull/3163
[08:22:22] <elukey>	 ahhh nice!
[08:35:42] <Amir1>	 elukey: Thanks!
[08:36:11] <elukey>	 thank you!
[08:37:26] <elukey>	 "mediawiki_history for 2020-11 now available" ah!
[08:37:35] <joal>	 \o/
[08:37:35] <elukey>	 joal: does it mean that I can mess up with hive? :D
[08:38:03] <joal>	 elukey: if ou could wait for mediawiki-history-reduced that'd be awesome elukey - I'm sorry to slow you :(
[08:38:10] <elukey>	 ahhh yes right
[08:38:17] <elukey>	 nono but I am not ready yet
[08:43:11] <elukey>	 when you have a moment I'll explain my crazy idea
[08:43:20] <joal>	 Please elukey :)
[08:44:01] <elukey>	 I thought to split the hive-site.xml config into two sets:
[08:44:45] <elukey>	 1) clients - they will get for metastore and server the same analytics-hive.eqiad.wmnet. More flexible, we loose the multi-metastore line etc.. but some tools don't support it
[08:45:29] <elukey>	 2) an-coords - in this case, we can explicilty state in the hive-site config that there are multiple metastores, so (as I hope) the server will know how to leverage those in case it is needed
[08:45:47] <elukey>	 say a metastore gets overloaded, the server will be able to try with the other one
[08:46:15] <joal>	 hm
[08:46:28] <joal>	 trying to speak my understanding
[08:46:34] <joal>	 actually, to write it
[08:47:16] <joal>	 still a single hive-site.xml, but different versions depending on host (clients vs an-coords)
[08:47:48] <joal>	 I think we already have that, as only the an-coords ones have the mysql pass for instance
[08:47:56] <joal>	 But it's a detail :)
[08:48:20] <elukey>	 yep yep
[08:48:30] <joal>	 right
[08:49:11] <joal>	 Now the thing I wonder is: is there a value of having an-coords have knowledge of the 2 metastores
[08:49:44] <elukey>	 the idea that I have is that the hive server will be able to use both
[08:50:11] <joal>	 I hear that
[08:50:21] <elukey>	 for example, last week when that test hammered one metastore
[08:50:39] <elukey>	 maybe if we had the server aware of another metastore, it would have not impacted other jobs
[08:51:57] <elukey>	 it is just to avoid a metastore sitting completely idle
[08:52:01] <joal>	 I exactly get your point - I wonder if the server is able to handle two metastores
[08:52:54] <elukey>	 surely the support is best on 2.x, but I think there is something also for 1.x
[08:52:56] <joal>	 yeah I understand elukey - I am a bit afraid of active-active mode, but eh, we can try!
[08:53:04] <elukey>	 joal: why afraid?
[08:53:48] <elukey>	 (trying to understand the problems)
[08:53:56] <elukey>	 because it would be handy for example if I have to restart stuff
[08:54:25] <joal>	 elukey: concurrency problems - I do an heavy request on ms1 - it fails to respond fast as the request is heavy - I try it on ms2 - but the DB state has already been altered by the currently processed request on ms1 - problem
[08:55:30] <elukey>	 joal: sure but you'll do it within an hive session that it is shared, I hope that the use case is handled :D
[08:55:35] <joal>	 I assume that all that is covered by transactions at the metastore+db level, but the granularity of the transactions is kinda important depending on actions (I'm thinking of repair table here)
[08:56:21] <Amir1>	 https://usercontent.irccloud-cdn.com/file/RPlXzssu/image.png
[08:56:23] <Amir1>	 elukey: ^
[08:56:36] <elukey>	 ahahahah
[08:56:39] <joal>	 Nice Amir1 :)
[08:57:13] <Amir1>	 a couple hundred cases and we are done
[08:57:25] <elukey>	 ok joal I get that you are not fond of the change, I'll wait for hive 2.x then, will do the simple one for now
[08:58:28] <joal>	 elukey: Even if not nice in term of usage, I'd still go with full CNAME for all (clients + coords), making it easier to understand and maintain :)
[08:58:44] <joal>	 Thank you elukey for your understanding :)
[09:03:04] <elukey>	 joal: sigh ok
[09:03:49] * joal sends wikilve to elukey 
[09:10:25] <elukey>	 joal: I have also to test that the setting with analytics-hive works for the metastore
[09:10:32] <joal>	 yessir
[09:10:35] <elukey>	 since it is not stated in any tutorial
[09:27:46] <joal>	 elukey: can ou refresh my memory of the kafka version qwe use currently please?
[09:28:31] <elukey>	 joal: 1.1 
[09:28:43] <joal>	 Thanks mate
[09:28:51] <joal>	 I thought it was higher (1.5 or something)
[09:28:53] <elukey>	 but in the bright future, with the new test cluster that ra*zzi is building, we should be able to test 2.x
[09:29:03] <joal>	 ack!
[09:40:14] <elukey>	 joal: is there a coordinator that uses spark actions that we can test in hadoop test?
[09:40:29] <joal>	 elukey: I updated one a while back
[09:40:37] <joal>	 elukey: let me try to find that
[09:42:07] <joal>	 elukey: sorry - more precise answer: The only spark action we can use in hadoop test is mobile-app session metrics (based on data present on the cluster)
[09:42:22] <elukey>	 ah nice, I can look for it and try to start it
[09:42:30] <joal>	 elukey: However this coord doesn't make use of hive features in spark (no connection to metastore)
[09:42:42] <elukey>	 ah then it is not useful
[09:43:04] <joal>	 elukey: therefore I updated a job last time we tested to have a job that not only tests spark action, but also sprk with hive
[09:44:11] <joal>	 I however can't from where I did that :(
[09:44:16] <joal>	 +recalln
[09:45:27] <joal>	 elukey: I assume all hosts have been reimaged recently for the test cluster - Were the /home folder crushed?
[09:45:56] <elukey>	 joal: the old hosts have been decommed :(
[09:46:09] <joal>	 Ahhhhhh - of course !!!
[09:46:22] <elukey>	 but I can come up with an oozie job that does a simple spark action, it shouldn't be hard
[09:46:46] <joal>	 ok - let send a CR for an update of the mobile-app session job that will do nothig but a small test for us
[09:46:49] <joal>	 elukey: --^
[09:47:01] <elukey>	 if you have time thanks :)
[09:47:39] <joal>	 elukey: I wish to get my elukey-karma back by setting up prioritize correctly :D
[10:00:49] <wikibugs>	 (03PS1) 10Joal: [TEST] Update mobile-app session job to test spark-hive [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/645043
[10:00:53] <joal>	 elukey: --^
[10:01:03] <elukey>	 nice thanks!
[10:01:37] <joal>	 elukey: I can help with testing if you wish - Or at least explain how I'd go with that
[10:01:52] <elukey>	 so I need to rebuild source in test, upload to hdfs and then test
[10:02:01] <elukey>	 ah yes sure
[10:02:31] <joal>	 elukey: Also - I wondered how we stand on new hosts being available - thinking on starting distcp for backup
[10:03:22] <joal>	 elukey: indeed - build refinery-source, upload the refinery-job jar to hdfs, test the oozie job with the parameter for jar set to the newly updated jar
[10:31:59] <elukey>	 ok so there might be some problem in using analytics-hive also for the metastore
[10:32:25] <joal>	 :(
[10:32:35] <elukey>	 for example, say analytics-hive.eqiad.wmnet -> an-coord1001.eqiad.wmnet
[10:32:50] <elukey>	 then the server on 1001 correctly connects to its metastore
[10:33:04] <elukey>	 but on 1002 it does not, since it connects to 1001's
[10:33:34] <elukey>	 so if 1001 goes down, then we failover to 1002, but we need to restart the server to pick up the metastore on 1002
[10:33:37] <elukey>	 that it is not ideal
[10:34:23] <elukey>	 so on the coordinators the thrift uri should be an-coord100x.eqiad.wmnet
[10:34:37] <elukey>	 meanwhile on the clients we keep analytics-hive.eqiad.wmnet
[10:43:21] <joal>	 It makes sense elukey 
[10:45:55] <elukey>	 credentials-wise it should be ok
[11:09:12] <elukey>	 ok solved the problem in puppet, now we should be ready
[11:09:26] <elukey>	 the first restart will be to add the DBTokenStore support
[11:09:34] <elukey>	 then I'll add the metastore on 1002
[11:09:48] <elukey>	 aaand after that we should be able to move all oozie jobs to analytics-hive
[11:10:04] <elukey>	 and failover anytime if needed
[11:10:18] <elukey>	 (I'll wait for mw history reduced of course)
[11:12:19] <elukey>	 --
[11:12:35] <elukey>	 John (DCOps) and I are going to move other nodes from 10g to 1g racks
[11:12:55] <elukey>	 I have on the list
[11:12:58] <elukey>	 druid1003
[11:13:02] <elukey>	 aqs1006
[11:13:04] <elukey>	 db1108
[11:13:16] <elukey>	 druid1001
[11:13:32] <elukey>	 joal: --^ as FYI
[11:29:13] <elukey>	 starting with druid1001
[12:14:26] <elukey>	 druid1001 done
[12:17:23] <elukey>	 doing aqs1006
[12:34:21] <joal>	 ack elukey - sorry was gone for kids to eat
[12:34:40] <elukey>	 nono I just wanted to let you know :)
[12:34:46] <elukey>	 I am going to take care of the failed jobs
[12:34:58] <joal>	 as you wish elukey, I can help as well with that
[12:35:07] <elukey>	 nono don't worry :)
[12:35:12] <elukey>	 (Thanks anyway <3)
[12:52:49] <elukey>	 now I am moving db1108, our dear backup host
[12:53:04] <elukey>	 then I'll do druid1003
[13:53:45] <tizianop>	 Hi, ORES is down from a couple of days ("server overloaded"). Do you know who should be informed?
[13:54:08] <tizianop>	 example query here: https://www.mediawiki.org/wiki/ORES#API_usage
[13:54:18] <joal>	 hi tizianop - The right person to ping would be chrisalbon, and probably klausman too
[13:55:04] <tizianop>	 will do. thanks
[13:56:08] <elukey>	 tizianop: hi! are you hitting the wmflabs endpoing or the prod one?
[13:56:31] <elukey>	 *endpoint
[13:56:54] <tizianop>	 the one used in the example: https://ores.wmflabs.org/v3/scores/enwiki/?models=draftquality%7Cwp10&revids=34854345%7C485104318
[13:58:01] <elukey>	 tizianop: yep it is the wmflabs one, not the production one.. try to swap ores.wmflabs.org with ores.wikimedia.org
[13:58:18] <elukey>	 https://ores.wikimedia.org/
[13:58:30] <joal>	 nice catch elukey!
[13:58:52] <elukey>	 I am not sure what are the policies to use the prod one though..
[13:59:11] <elukey>	 tizianop: if you are planning to use a script to pull things please read https://meta.wikimedia.org/wiki/User-Agent_policy
[13:59:19] <tizianop>	 it works \o/
[14:00:25] <elukey>	 :)
[14:00:47] <elukey>	 Having Ores down like the cloud/labs endpoint would mean half of wikipedians contacting SRE :D
[14:01:50] <tizianop>	 thank you!
[14:19:42] <klausman>	 Also of note: all I know of ORES is its rough purpose, and that it's a bit of a maintenance nightmare. Chris would definitely be the person to poke for problems with it, as he knows who knows etc.
[14:23:38] * elukey reads all excuses, blame Tobias for ORES! :P
[14:26:40] <elukey>	 klausman: I have a pontentially disruptive and dangerous rack move to do in 10 mins more or less (with John, DCOps)
[14:26:48] <elukey>	 if you want I can explain what I am doing
[14:36:45] <joal>	 ottomata: I was very wrong indeed - Looks like you can add jars at spark-session creation!
[14:43:46] <klausman>	 elukey: yes!
[14:44:07] <elukey>	 klausman: bc!
[14:44:11] <klausman>	 Sorry, I was staring at Kafka streams trying to divine the future :)
[14:46:45] <wikibugs>	 10Analytics: Can't use custom conda kernel in Newpyter within PySpark UDFs - https://phabricator.wikimedia.org/T269358 (10Isaac)
[15:01:00] <mforns>	 heya teammm
[15:15:52] <wikibugs>	 10Analytics, 10Analytics-Kanban: Investigate oozie banner monthly job timeouts - https://phabricator.wikimedia.org/T264358 (10Ottomata) > The problem we were having was that the jobs were timingout before execution time due to the long delay (longer than 60 days), leading to success-files being deleted for the...
[15:36:59] <klausman>	 elukey: the Supercomputer I was thinking about isn't in Italy, but Spain: https://www.bsc.es/marenostrum/marenostrum
[15:37:25] <klausman>	 https://www.bsc.es/sites/default/files/public/gallery/2017_bsc_superordenador_marenostrum-4_barcelona-supercomputing-center.jpg Fancy as hell
[15:37:30] <elukey>	 wow
[15:37:58] <elukey>	 yes simply amazing
[15:38:06] <klausman>	 But I could see an Italian university having similar aesthetics for their computing centers :)
[15:38:36] <elukey>	 Pisa was one of first univs with supercomputers IIRC so I thought it could have been possible
[15:39:46] <klausman>	 The leaning bigtower of Pisa :)
[15:52:06] <elukey>	 razzi: o/ - I have restarted memcached on an-tool1010, it was in a weird state (saw an alarm on icinga)
[15:58:40] <elukey>	 razzi: also, if you want we can go through https://gerrit.wikimedia.org/r/c/operations/puppet/+/644962 later on so I can explain how it works
[16:02:56] <razzi>	 elukey: thanks for that restart
[16:03:33] <razzi>	 elukey: I have some time now, want to hop on a call?
[16:03:38] <elukey>	 razzi: sure
[16:14:37] <chrisalbon>	 ORES isn't down, I know this because I didn't wake up to 10000 pings on my phone
[16:16:23] <chrisalbon>	 And terror deep in my heart
[16:20:47] <bearloga>	 chrisalbon: if ORES ever malfunctions and starts marking all edits as vandalism you should call it ARES (_abjective_ revision evaluation service) which also works because Ares "represents the violent and untamed aspect of war"
[16:21:14] <bearloga>	 just throwin' that out there
[16:21:24] <chrisalbon>	 approved
[16:22:27] <klausman>	 O tempora, o ORES.
[16:24:30] <bearloga>	 who's brave enough to add klausman to the cultural references section of https://en.wikipedia.org/wiki/O_tempora,_o_mores!
[16:27:24] <fdans>	 hey a-team, let's skip standup and grosking in favor of the monthly staff meeting
[16:27:42] <mforns>	 ok
[16:36:37] <wikibugs>	 (03PS4) 10Fdans: Add historical_raw job to load data from pagecounts_raw [analytics/refinery] - 10https://gerrit.wikimedia.org/r/640146 (https://phabricator.wikimedia.org/T251777)
[16:37:23] <wikibugs>	 (03CR) 10Fdans: "Sorry, it seems I hadn't hit save before committing changes" (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/640146 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans)
[16:40:05] <milimetric>	 ok, no standup means we do a virtual ops week handoff.  razzi/ottomata: let me know if there's anything to know, alarms you didn't handle etc.  I can do the mw history snapshot upgrade in aqs
[16:54:04] <ottomata>	 milimetric:  that was the only thing i was going to ask about
[16:54:19] <ottomata>	 milimetric:  maybe razzi  and i can do that with you, i've never done it
[16:55:53] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add folder creation for sqoop initial installation in puppet - https://phabricator.wikimedia.org/T251788 (10Ottomata)
[16:56:13] <ottomata>	 milimetric:  i moved some tasks out of ops week that have owners and no actionables
[16:56:19] <ottomata>	 the 2 that are left i'm not sure about
[16:56:25] <ottomata>	 ping fdans  and joal about those
[16:56:31] <ottomata>	 should the be moved out of ops week?
[16:56:45] <fdans>	 looking
[16:56:52] <milimetric>	 ottomata: ok, great I'll take a look too and we can do the datasource thing anytime, I guess after the all hands
[16:58:01] <fdans>	 ottomata: yea mine is not really an ops week task
[16:58:49] <fdans>	 i'm going to move it to blocked
[17:01:26] <ottomata>	 milimetric:  can we do the  mw hw aqs upgrade sometime between 2 and 4?
[17:22:37] <wikibugs>	 10Analytics: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10elukey) This will be an analytics goal for next quarter :)
[17:42:36] <wikibugs>	 10Analytics, 10Product-Infrastructure-Team-Backlog: [Bug] Metrics API missing November and December data - https://phabricator.wikimedia.org/T269360 (10Mholloway)
[17:57:41] <wikibugs>	 10Analytics, 10Analytics-Kanban: Deprecate the 'researchers' posix group - https://phabricator.wikimedia.org/T268801 (10elukey) Pinging people in here too: @Catrope @dduvall  @cooltey @MarkTraceur @Jhernandez @dchen @Etonkovidova @Legoktm @matmarex @DStrine @debt @Sharvaniharan  Email sent:  >Hi! >If you are r...
[18:08:49] <elukey>	 going afk people o/
[18:17:06] <wikibugs>	 10Analytics, 10Analytics-Kanban: Deprecate the 'researchers' posix group - https://phabricator.wikimedia.org/T268801 (10DStrine) I'm not sure what this is but I'm pretty sure I don't use this. thanks for the ping.
[18:17:18] <milimetric>	 ottomata: let's shoot for 14:35 EST, cc razzi
[18:17:31] <razzi>	 milimetric: sg
[18:25:40] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Isaac) Additional data to hopefully help see the impact of the different privacy unique actor thresholds would have on wh...
[18:50:57] <ottomata>	 can do
[18:51:00] <ottomata>	 perfect
[19:35:59] <ottomata>	 milimetric:  razzi aqs?
[19:36:36] <razzi>	 ottomata: yeah, tardis?
[19:38:10] <ottomata>	 need tardis link...and need milimetric  :)
[19:38:25] <razzi>	 I can help with the tardis link: https://meet.google.com/kti-iybt-ekv
[19:38:43] <milimetric>	 hey, omw there sorry
[20:41:21] <wikibugs>	 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Team-Backlog: [Bug] Metrics API missing November and December data - https://phabricator.wikimedia.org/T269360 (10LGoto)
[22:11:51] <wikibugs>	 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Team-Backlog: [Bug] Metrics API missing November and December data - https://phabricator.wikimedia.org/T269360 (10JMinor) Looks like the monthly data for November is now up and working. There is a lag as the month turns over, and the processing finis...
[22:34:18] <milimetric>	 !log updated mw history snapshot on AQS
[22:34:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:38:08] <wikibugs>	 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-iOS-App-Backlog: [Bug] Metrics API missing November and December data - https://phabricator.wikimedia.org/T269360 (10LGoto)