[07:09:42] !log manual reset-failed refinery-sqoop-whole-mediawiki.service on an-launcher1002 (job launched manually) [07:09:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:13:00] RECOVERY - Check the last execution of refinery-sqoop-whole-mediawiki on an-launcher1002 is OK: OK: Status of the systemd unit refinery-sqoop-whole-mediawiki https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:19:51] Good morning [07:19:57] Thanks elukey for the reset of timer [07:21:09] bonjour! [07:23:50] elukey: I'm still having trouble sqooping the commonswiki image table :( [07:24:16] I'm gonna leave it for now, trying to make progress on other projects, and then back to it [07:26:52] ack! [07:40:26] 10Analytics, 10Analytics-Kanban: Investigate oozie banner monthly job timeouts - https://phabricator.wikimedia.org/T264358 (10JAllemandou) From https://github.com/wikimedia/analytics-refinery/blob/master/oozie/banner_activity/druid/monthly/coordinator.xml#L52, the dataset dependency is set up to wait for both:... [07:46:09] (03CR) 10Joal: "2 nits marked as done but actually not done, rest looks good" (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/640146 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans) [08:20:27] elukey: look at that!!! https://github.com/apache/incubator-gobblin/pull/3163 [08:22:22] ahhh nice! [08:35:42] elukey: Thanks! [08:36:11] thank you! [08:37:26] "mediawiki_history for 2020-11 now available" ah! [08:37:35] \o/ [08:37:35] joal: does it mean that I can mess up with hive? :D [08:38:03] elukey: if ou could wait for mediawiki-history-reduced that'd be awesome elukey - I'm sorry to slow you :( [08:38:10] ahhh yes right [08:38:17] nono but I am not ready yet [08:43:11] when you have a moment I'll explain my crazy idea [08:43:20] Please elukey :) [08:44:01] I thought to split the hive-site.xml config into two sets: [08:44:45] 1) clients - they will get for metastore and server the same analytics-hive.eqiad.wmnet. More flexible, we loose the multi-metastore line etc.. but some tools don't support it [08:45:29] 2) an-coords - in this case, we can explicilty state in the hive-site config that there are multiple metastores, so (as I hope) the server will know how to leverage those in case it is needed [08:45:47] say a metastore gets overloaded, the server will be able to try with the other one [08:46:15] hm [08:46:28] trying to speak my understanding [08:46:34] actually, to write it [08:47:16] still a single hive-site.xml, but different versions depending on host (clients vs an-coords) [08:47:48] I think we already have that, as only the an-coords ones have the mysql pass for instance [08:47:56] But it's a detail :) [08:48:20] yep yep [08:48:30] right [08:49:11] Now the thing I wonder is: is there a value of having an-coords have knowledge of the 2 metastores [08:49:44] the idea that I have is that the hive server will be able to use both [08:50:11] I hear that [08:50:21] for example, last week when that test hammered one metastore [08:50:39] maybe if we had the server aware of another metastore, it would have not impacted other jobs [08:51:57] it is just to avoid a metastore sitting completely idle [08:52:01] I exactly get your point - I wonder if the server is able to handle two metastores [08:52:54] surely the support is best on 2.x, but I think there is something also for 1.x [08:52:56] yeah I understand elukey - I am a bit afraid of active-active mode, but eh, we can try! [08:53:04] joal: why afraid? [08:53:48] (trying to understand the problems) [08:53:56] because it would be handy for example if I have to restart stuff [08:54:25] elukey: concurrency problems - I do an heavy request on ms1 - it fails to respond fast as the request is heavy - I try it on ms2 - but the DB state has already been altered by the currently processed request on ms1 - problem [08:55:30] joal: sure but you'll do it within an hive session that it is shared, I hope that the use case is handled :D [08:55:35] I assume that all that is covered by transactions at the metastore+db level, but the granularity of the transactions is kinda important depending on actions (I'm thinking of repair table here) [08:56:21] https://usercontent.irccloud-cdn.com/file/RPlXzssu/image.png [08:56:23] elukey: ^ [08:56:36] ahahahah [08:56:39] Nice Amir1 :) [08:57:13] a couple hundred cases and we are done [08:57:25] ok joal I get that you are not fond of the change, I'll wait for hive 2.x then, will do the simple one for now [08:58:28] elukey: Even if not nice in term of usage, I'd still go with full CNAME for all (clients + coords), making it easier to understand and maintain :) [08:58:44] Thank you elukey for your understanding :) [09:03:04] joal: sigh ok [09:03:49] * joal sends wikilve to elukey [09:10:25] joal: I have also to test that the setting with analytics-hive works for the metastore [09:10:32] yessir [09:10:35] since it is not stated in any tutorial [09:27:46] elukey: can ou refresh my memory of the kafka version qwe use currently please? [09:28:31] joal: 1.1 [09:28:43] Thanks mate [09:28:51] I thought it was higher (1.5 or something) [09:28:53] but in the bright future, with the new test cluster that ra*zzi is building, we should be able to test 2.x [09:29:03] ack! [09:40:14] joal: is there a coordinator that uses spark actions that we can test in hadoop test? [09:40:29] elukey: I updated one a while back [09:40:37] elukey: let me try to find that [09:42:07] elukey: sorry - more precise answer: The only spark action we can use in hadoop test is mobile-app session metrics (based on data present on the cluster) [09:42:22] ah nice, I can look for it and try to start it [09:42:30] elukey: However this coord doesn't make use of hive features in spark (no connection to metastore) [09:42:42] ah then it is not useful [09:43:04] elukey: therefore I updated a job last time we tested to have a job that not only tests spark action, but also sprk with hive [09:44:11] I however can't from where I did that :( [09:44:16] +recalln [09:45:27] elukey: I assume all hosts have been reimaged recently for the test cluster - Were the /home folder crushed? [09:45:56] joal: the old hosts have been decommed :( [09:46:09] Ahhhhhh - of course !!! [09:46:22] but I can come up with an oozie job that does a simple spark action, it shouldn't be hard [09:46:46] ok - let send a CR for an update of the mobile-app session job that will do nothig but a small test for us [09:46:49] elukey: --^ [09:47:01] if you have time thanks :) [09:47:39] elukey: I wish to get my elukey-karma back by setting up prioritize correctly :D [10:00:49] (03PS1) 10Joal: [TEST] Update mobile-app session job to test spark-hive [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/645043 [10:00:53] elukey: --^ [10:01:03] nice thanks! [10:01:37] elukey: I can help with testing if you wish - Or at least explain how I'd go with that [10:01:52] so I need to rebuild source in test, upload to hdfs and then test [10:02:01] ah yes sure [10:02:31] elukey: Also - I wondered how we stand on new hosts being available - thinking on starting distcp for backup [10:03:22] elukey: indeed - build refinery-source, upload the refinery-job jar to hdfs, test the oozie job with the parameter for jar set to the newly updated jar [10:31:59] ok so there might be some problem in using analytics-hive also for the metastore [10:32:25] :( [10:32:35] for example, say analytics-hive.eqiad.wmnet -> an-coord1001.eqiad.wmnet [10:32:50] then the server on 1001 correctly connects to its metastore [10:33:04] but on 1002 it does not, since it connects to 1001's [10:33:34] so if 1001 goes down, then we failover to 1002, but we need to restart the server to pick up the metastore on 1002 [10:33:37] that it is not ideal [10:34:23] so on the coordinators the thrift uri should be an-coord100x.eqiad.wmnet [10:34:37] meanwhile on the clients we keep analytics-hive.eqiad.wmnet [10:43:21] It makes sense elukey [10:45:55] credentials-wise it should be ok [11:09:12] ok solved the problem in puppet, now we should be ready [11:09:26] the first restart will be to add the DBTokenStore support [11:09:34] then I'll add the metastore on 1002 [11:09:48] aaand after that we should be able to move all oozie jobs to analytics-hive [11:10:04] and failover anytime if needed [11:10:18] (I'll wait for mw history reduced of course) [11:12:19] -- [11:12:35] John (DCOps) and I are going to move other nodes from 10g to 1g racks [11:12:55] I have on the list [11:12:58] druid1003 [11:13:02] aqs1006 [11:13:04] db1108 [11:13:16] druid1001 [11:13:32] joal: --^ as FYI [11:29:13] starting with druid1001 [12:14:26] druid1001 done [12:17:23] doing aqs1006 [12:34:21] ack elukey - sorry was gone for kids to eat [12:34:40] nono I just wanted to let you know :) [12:34:46] I am going to take care of the failed jobs [12:34:58] as you wish elukey, I can help as well with that [12:35:07] nono don't worry :) [12:35:12] (Thanks anyway <3) [12:52:49] now I am moving db1108, our dear backup host [12:53:04] then I'll do druid1003 [13:53:45] Hi, ORES is down from a couple of days ("server overloaded"). Do you know who should be informed? [13:54:08] example query here: https://www.mediawiki.org/wiki/ORES#API_usage [13:54:18] hi tizianop - The right person to ping would be chrisalbon, and probably klausman too [13:55:04] will do. thanks [13:56:08] tizianop: hi! are you hitting the wmflabs endpoing or the prod one? [13:56:31] *endpoint [13:56:54] the one used in the example: https://ores.wmflabs.org/v3/scores/enwiki/?models=draftquality%7Cwp10&revids=34854345%7C485104318 [13:58:01] tizianop: yep it is the wmflabs one, not the production one.. try to swap ores.wmflabs.org with ores.wikimedia.org [13:58:18] https://ores.wikimedia.org/ [13:58:30] nice catch elukey! [13:58:52] I am not sure what are the policies to use the prod one though.. [13:59:11] tizianop: if you are planning to use a script to pull things please read https://meta.wikimedia.org/wiki/User-Agent_policy [13:59:19] it works \o/ [14:00:25] :) [14:00:47] Having Ores down like the cloud/labs endpoint would mean half of wikipedians contacting SRE :D [14:01:50] thank you! [14:19:42] Also of note: all I know of ORES is its rough purpose, and that it's a bit of a maintenance nightmare. Chris would definitely be the person to poke for problems with it, as he knows who knows etc. [14:23:38] * elukey reads all excuses, blame Tobias for ORES! :P [14:26:40] klausman: I have a pontentially disruptive and dangerous rack move to do in 10 mins more or less (with John, DCOps) [14:26:48] if you want I can explain what I am doing [14:36:45] ottomata: I was very wrong indeed - Looks like you can add jars at spark-session creation! [14:43:46] elukey: yes! [14:44:07] klausman: bc! [14:44:11] Sorry, I was staring at Kafka streams trying to divine the future :) [14:46:45] 10Analytics: Can't use custom conda kernel in Newpyter within PySpark UDFs - https://phabricator.wikimedia.org/T269358 (10Isaac) [15:01:00] heya teammm [15:15:52] 10Analytics, 10Analytics-Kanban: Investigate oozie banner monthly job timeouts - https://phabricator.wikimedia.org/T264358 (10Ottomata) > The problem we were having was that the jobs were timingout before execution time due to the long delay (longer than 60 days), leading to success-files being deleted for the... [15:36:59] elukey: the Supercomputer I was thinking about isn't in Italy, but Spain: https://www.bsc.es/marenostrum/marenostrum [15:37:25] https://www.bsc.es/sites/default/files/public/gallery/2017_bsc_superordenador_marenostrum-4_barcelona-supercomputing-center.jpg Fancy as hell [15:37:30] wow [15:37:58] yes simply amazing [15:38:06] But I could see an Italian university having similar aesthetics for their computing centers :) [15:38:36] Pisa was one of first univs with supercomputers IIRC so I thought it could have been possible [15:39:46] The leaning bigtower of Pisa :) [15:52:06] razzi: o/ - I have restarted memcached on an-tool1010, it was in a weird state (saw an alarm on icinga) [15:58:40] razzi: also, if you want we can go through https://gerrit.wikimedia.org/r/c/operations/puppet/+/644962 later on so I can explain how it works [16:02:56] elukey: thanks for that restart [16:03:33] elukey: I have some time now, want to hop on a call? [16:03:38] razzi: sure [16:14:37] ORES isn't down, I know this because I didn't wake up to 10000 pings on my phone [16:16:23] And terror deep in my heart [16:20:47] chrisalbon: if ORES ever malfunctions and starts marking all edits as vandalism you should call it ARES (_abjective_ revision evaluation service) which also works because Ares "represents the violent and untamed aspect of war" [16:21:14] just throwin' that out there [16:21:24] approved [16:22:27] O tempora, o ORES. [16:24:30] who's brave enough to add klausman to the cultural references section of https://en.wikipedia.org/wiki/O_tempora,_o_mores! [16:27:24] hey a-team, let's skip standup and grosking in favor of the monthly staff meeting [16:27:42] ok [16:36:37] (03PS4) 10Fdans: Add historical_raw job to load data from pagecounts_raw [analytics/refinery] - 10https://gerrit.wikimedia.org/r/640146 (https://phabricator.wikimedia.org/T251777) [16:37:23] (03CR) 10Fdans: "Sorry, it seems I hadn't hit save before committing changes" (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/640146 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans) [16:40:05] ok, no standup means we do a virtual ops week handoff. razzi/ottomata: let me know if there's anything to know, alarms you didn't handle etc. I can do the mw history snapshot upgrade in aqs [16:54:04] milimetric: that was the only thing i was going to ask about [16:54:19] milimetric: maybe razzi and i can do that with you, i've never done it [16:55:53] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add folder creation for sqoop initial installation in puppet - https://phabricator.wikimedia.org/T251788 (10Ottomata) [16:56:13] milimetric: i moved some tasks out of ops week that have owners and no actionables [16:56:19] the 2 that are left i'm not sure about [16:56:25] ping fdans and joal about those [16:56:31] should the be moved out of ops week? [16:56:45] looking [16:56:52] ottomata: ok, great I'll take a look too and we can do the datasource thing anytime, I guess after the all hands [16:58:01] ottomata: yea mine is not really an ops week task [16:58:49] i'm going to move it to blocked [17:01:26] milimetric: can we do the mw hw aqs upgrade sometime between 2 and 4? [17:22:37] 10Analytics: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10elukey) This will be an analytics goal for next quarter :) [17:42:36] 10Analytics, 10Product-Infrastructure-Team-Backlog: [Bug] Metrics API missing November and December data - https://phabricator.wikimedia.org/T269360 (10Mholloway) [17:57:41] 10Analytics, 10Analytics-Kanban: Deprecate the 'researchers' posix group - https://phabricator.wikimedia.org/T268801 (10elukey) Pinging people in here too: @Catrope @dduvall @cooltey @MarkTraceur @Jhernandez @dchen @Etonkovidova @Legoktm @matmarex @DStrine @debt @Sharvaniharan Email sent: >Hi! >If you are r... [18:08:49] going afk people o/ [18:17:06] 10Analytics, 10Analytics-Kanban: Deprecate the 'researchers' posix group - https://phabricator.wikimedia.org/T268801 (10DStrine) I'm not sure what this is but I'm pretty sure I don't use this. thanks for the ping. [18:17:18] ottomata: let's shoot for 14:35 EST, cc razzi [18:17:31] milimetric: sg [18:25:40] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Isaac) Additional data to hopefully help see the impact of the different privacy unique actor thresholds would have on wh... [18:50:57] can do [18:51:00] perfect [19:35:59] milimetric: razzi aqs? [19:36:36] ottomata: yeah, tardis? [19:38:10] need tardis link...and need milimetric :) [19:38:25] I can help with the tardis link: https://meet.google.com/kti-iybt-ekv [19:38:43] hey, omw there sorry [20:41:21] 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Team-Backlog: [Bug] Metrics API missing November and December data - https://phabricator.wikimedia.org/T269360 (10LGoto) [22:11:51] 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Team-Backlog: [Bug] Metrics API missing November and December data - https://phabricator.wikimedia.org/T269360 (10JMinor) Looks like the monthly data for November is now up and working. There is a lag as the month turns over, and the processing finis... [22:34:18] !log updated mw history snapshot on AQS [22:34:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:38:08] 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-iOS-App-Backlog: [Bug] Metrics API missing November and December data - https://phabricator.wikimedia.org/T269360 (10LGoto)