[00:01:01] (03CR) 10Nuria: [C: 032] Defer to the config to specify the area [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/464906 (https://phabricator.wikimedia.org/T188792) (owner: 10Milimetric) [00:05:33] (03Merged) 10jenkins-bot: Defer to the config to specify the area [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/464906 (https://phabricator.wikimedia.org/T188792) (owner: 10Milimetric) [01:45:39] 10Quarry: Include query execution time - https://phabricator.wikimedia.org/T126888 (10Huji) The only advantage would be to know if queries are getting stuck in "pending" mode for too long. It used to be an issue a while back but hasn't been for a long time. [02:41:13] 10Quarry: Include query execution time - https://phabricator.wikimedia.org/T126888 (10zhuyifei1999) It was probably {T172143}. It would stay pending anyhow. [03:27:09] 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings, 10Product-Analytics, and 3 others: Ingest data from PageIssues EventLogging schema into Druid - https://phabricator.wikimedia.org/T202751 (10Tbayer) @mforns Great to hear that Druid [[https://phabricator.wikimedia.org/T201873#4633754 |already allows]]... [03:36:00] 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings, 10Product-Analytics, and 3 others: Ingest data from PageIssues EventLogging schema into Druid - https://phabricator.wikimedia.org/T202751 (10Tbayer) Another question: It seems that the dimensions lack e.g. `Ua Browser Major` and other user agent deriv... [03:48:58] 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings, 10Product-Analytics, and 3 others: Ingest data from PageIssues EventLogging schema into Druid - https://phabricator.wikimedia.org/T202751 (10Tbayer) BTW, I understand we are focusing on use in Turnilo for now, but out of curiosity (and considering the... [04:07:27] 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings, 10Product-Analytics, and 3 others: Ingest data from PageIssues EventLogging schema into Druid - https://phabricator.wikimedia.org/T202751 (10Tbayer) Back to the view in Turnilo: This looks very exciting indeed! I have to mention that @ovasileva and I... [07:14:23] !log stopped all crons on analytics1003 as prep step for migration to an-coord1001 [07:14:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:14:26] morning :) [07:14:36] Hi elukey :) Investigating the error on MWH-reduced [07:36:58] joal: o/ - I am checking the refine failure, as Nuria mentioned it seems that it fails to allocate direct memory [07:37:11] rings a bell? Do we have a specific config for it? [07:37:46] elukey: "direct memory" is not really a term I have heard of so far for Spark - interested to understand more [07:38:51] so the stack trace doesn't really mention spark [07:38:54] but I can see [07:38:54] 18/10/08 17:20:50 ERROR RetryingBlockFetcher: Failed to fetch block shuffle_664_1_2, and will not retry (0 retries) [07:38:57] io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 1006632960, max: 1012924416) [07:39:46] elukey: shuffle issue from what I see - Could be related to dynamic allocation :( [07:42:32] it is interesting because it says that it already allocated 1G of direct memory [07:42:46] so we might have a specific setting for it? [07:42:52] or a default that we don't tune [07:44:22] the other thing is that https://yarn.wikimedia.org/cluster/app/application_1538849321221_6306 seems succeeded [07:44:44] ah yes but the failed refinement is separate [07:44:49] okok nevermind [07:45:17] elukey: the global-refine-job has plenty small refinements (per schema) and tracks the failed ones [07:46:40] ahh okok [07:46:46] other n00b question if I may [07:47:00] the docs says that I should find a _REFINE_FAILURE flag [07:47:00] elukey@analytics1003:~$ ls /mnt/hdfs/wmf/data/raw/eventlogging/eventlogging_ReadingDepth/hourly/2018/10/08/15 [07:47:04] eventlogging_ReadingDepth.1002.0.340743.901450646.1539010800000 eventlogging_ReadingDepth.1002.0.3652632.905111487.1539010800000 [07:47:07] but there isn't [07:47:26] 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10Tbayer) >>! In T153821#2962822, @Nuria wrote: > @Krenair if wikitech is not behing varnish pageviews cannot be collected. Correct. Seem... [07:49:15] elukey: hm - I assume it would be related to the type of job failure, but I'm not sure [07:50:15] JOSEPH IS NOT SURE? [07:50:46] joal: you are always sure with the right answer, don't fool me [07:50:48] :D [07:50:59] meh :) [07:51:12] 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10JAllemandou) `wikitech` is not part of the projects to account for in PageviewDefinition code (https://github.com/wikimedia/analytics-re... [07:58:38] there is something weird, even from the logs on an1003 I can see success [07:59:36] hm [07:59:50] Maybe Nuria's attempt fixed it [08:00:19] anyhow, it is 10 CEST [08:00:39] BRACE YOURSELF! an1003 will explode ! [08:01:00] ahhaha [08:01:11] jobs still running, going to amend my puppet patch [08:01:21] (in the meantime while we wait) [08:04:18] elukey: hdfs dfs -ls hdfs://analytics-hadoop/wmf/data/event/ReadingDepth/year=2018/month=10/day=8/hour=15 [08:05:22] ah I checked raw! [08:05:35] my bad, thanks :) [08:05:45] I just realized it now [08:08:26] there is likely some event that is problematic [08:10:27] hm [08:11:27] anyway, it can wait a bit :) [08:11:55] 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10Krenair) >>! In T153821#4651184, @Tbayer wrote: >>>! In T153821#2962822, @Nuria wrote: >> @Krenair if wikitech is not behing varnish pag... [08:14:35] 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10Krenair) >>! In T153821#4651186, @JAllemandou wrote: > `wikitech` is not part of the projects to account for in PageviewDefinition code... [08:22:23] I think that while we wait we can move superset/hue to an-coord1001 [08:22:48] +1 elukey [08:23:13] elukey: I alos think the currently running jobs shouldn't prevent us from moving [08:25:49] ok so I am currently updating the puppet compiler with the new hosts (an-master/coord) so I'll be able to check the puppet patch [08:26:59] the patch is this one https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/461997/ [08:28:10] so my idea is simple: stop oozie, stop hive (metastore/server2), stop hue, stop superset. Dump their databases, and copy them to an-coord1001 [08:28:15] then run puppet [08:28:19] and restart those daemons [08:28:57] we could stop superset and hue first, the oozie, then hive :) [08:29:10] The rest sounds good :) [08:30:31] yep [08:30:39] so superset/hue stopped, dumping databases [08:34:05] all right databases imported [08:34:23] joal: good to stop oozie/hive? [08:34:33] elukey: +1 ! [08:36:16] stopped, dumping databases [08:38:40] oozie's db seems to take more than the others [08:39:19] I'm not suprised - Every hive-partition has multiple-steps-jobs to generate them [08:40:27] 581M oozie_09102018.sql.gz [08:40:28] :D [08:41:40] 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10JAllemandou) > Why is this not documented on the wiki creation page? I don't underdstand what the 'wiki creation page' is, but I think t... [08:43:32] loading the databases now [08:43:38] (on an-coord1001 [08:46:13] ok joal merging the patches [08:46:15] *patch [08:46:18] k elukey [08:46:23] will start with hue and superset [08:46:30] why ? [08:46:49] nevermind -- [08:47:12] I am still importing the oozie/hive databases :) [08:47:19] that's a good reason :)p [08:51:46] doing hive and oozie [08:56:54] I don't see any connection to mysql on analytics1003 anymore, keeping it monitored [08:57:18] elukey: I think currently running job will possibly do [09:00:32] oozie is up [09:00:49] same thing for hue and superset [09:00:56] aaand also hive [09:01:03] now it is a matter of testing them :) [09:02:08] elukey: hue tells me oozie is happy (so far) - it seems to have recovered the running job [09:03:39] elukey: the hive-query-editor in hue gives me an error: 10:57:18 < joal> elukey: I think currently running job will possibly do [09:03:42] 11:00:32 <@elukey> oozie is up [09:03:50] mwarf - wrong copy paste sorry [09:03:55] Failed to open new session: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient [09:04:30] elukey: same on spark from stat1004 [09:04:46] elukey: has puppet run on the statmachineS? [09:05:06] yes yes, but hive-server seems complaining [09:05:14] :( [09:10:10] joal: now it should work, it needed a bit of encouragement [09:10:19] currently testing spark, seems ok [09:10:35] the status of the init.d scripts for hive/oozie is embarassing [09:10:45] :S [09:11:04] elukey: we should look at big top before commiting to that, should we? [09:11:31] elukey: spark happy, hive-query-editor in hue happy [09:11:32] joal: I strongly suspect that those are the same [09:11:39] good :) [09:12:41] elukey: are the drons restarted from an-coord01? [09:13:00] nope, it is still cron-less :) [09:13:42] k [09:13:54] Shall we move on that? [09:14:56] I am preparing the code change now [09:14:59] k [09:18:21] currently deploying all (including refinery via puppet) [09:19:33] (03PS1) 10Elukey: Replace analytics1003 with an-coord1001 [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/465362 (https://phabricator.wikimedia.org/T205509) [09:19:57] (03CR) 10Elukey: [V: 032 C: 032] Replace analytics1003 with an-coord1001 [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/465362 (https://phabricator.wikimedia.org/T205509) (owner: 10Elukey) [09:21:24] joal: puppet complete, camus restarting [09:21:42] new coordinator deployed ) [09:21:43] :) [09:22:12] \o/ [09:28:18] elukey: manually killing previsouly started oozie job (MWH-reduced) - Seems stuck [09:28:22] ack [09:29:10] PROBLEM - Age of most recent Analytics meta MySQL database backup files on an-master1002 is CRITICAL: CRITICAL: 0/1 -- /srv/backup/mysql/analytics-meta: No files [09:29:12] elukey: refine job started - Looks like cron is cronning [09:29:23] interesting alert [09:29:30] yeah working on it [09:29:54] I just re-enabled puppet on an-master1002, I wanted to save the last analytics1003's backup [09:32:19] checked all the crons on an1003, and disabled the systemd timers [09:32:27] so we should be good [09:33:19] elukey: looks like oozie has an issue with hive metastore [09:33:24] failed job [09:33:37] it tried to connect to an1003 for metastore [09:33:47] elukey: https://hue.wikimedia.org/jobbrowser/jobs/job_1538849321221_8108/single_logs [09:34:40] Hoooo ! elukey : hdfs://user/hive/hive-site.xml !!!! [09:35:27] indeed - incorrect value for hive.metastore.uris [09:36:49] one at the time :) [09:39:15] joal: was the oozie job recent or an old one? [09:40:07] also, can you tell me more about the hive uris setting and where you found it? [09:40:19] otherwise it is difficult to understand where to check :) [09:44:07] elukey: was gone to bathroom sorry [09:44:30] elukey: oozie jobs are configured to read their hive setttings from a hive-site.xml file [09:44:35] this file is to on hadoop [09:44:54] we have it stored here: hdfs://user/hive/hive-site.xml [09:45:08] and it contains old (an1003) values [09:45:20] ah ok now it makes sense [09:46:03] in theory it should have been updated [09:46:41] hm [09:47:33] we have an exec in puppet to uplaod it [09:47:40] but it might not have worked as expected [09:47:44] let's change it now manually [09:47:54] k [09:48:13] joal: are you doing it? [09:48:22] I can :) [09:51:35] joal: even if I can simply upload an-coord's one [09:51:36] lemme try [09:52:54] elukey: done [09:53:01] elukey: retrying ooie [09:53:31] done [09:53:33] ahhahah okok [09:53:36] :D [09:53:42] I just did sudo -u hdfs hdfs dfs -put -f /etc/hive/conf.analytics-hadoop/hive-site.xml /user/hive/hive-site.xml [09:54:15] so the code is in profile::hive::site_hdfs [09:54:23] it uploads the new file only if it gets refreshed [09:54:28] so not this cas [09:54:31] *case probably [09:55:27] Oozie problem solved - job started [10:01:23] !log Restart failed oozie jobs (webrequest, virtual-pageviews, mwh-reduced) [10:01:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:02:31] PROBLEM - Check the last execution of check_webrequest_partitions on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions [10:06:51] RECOVERY - Age of most recent Analytics meta MySQL database backup files on an-master1002 is OK: OK: 1/1 -- /srv/backup/mysql/analytics-meta: 0hrs [10:08:06] ok this is good --^ [10:19:49] mmmm check_webrequest_partitions is a bit weird [10:20:06] elukey: could cause cluster is a bit late? [10:20:44] ahhh no I confused all those M _ etc.. [10:20:51] it is only complaining about the last hour [10:20:57] righ [10:21:08] so yeah all proceeding good [10:26:58] setting analytics1003 as spare host to prevent any accidental attempt to come back to live [10:28:39] elukey: no zombies in an-vlan :) [10:29:45] hey team! [10:30:25] o/ [10:32:55] elukey, can I help with alarms? [10:33:25] mforns: all good, we replaced analytics1003 [10:33:36] ok :] [10:34:07] elukey, does it have another name now? [10:36:58] mforns: another host, an-coord1001.eqiad.wmnet :) [10:37:33] elukey, so data drop and refine jobs run there now, right? [10:38:13] 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10Krenair) >>! In T153821#4651201, @JAllemandou wrote: >> Why is this not documented on the wiki creation page? > I don't underdstand what... [10:38:29] mforns: yep! [10:38:34] k :] [10:40:07] mforns: just to be sure, can you check that you can access the host via ssh ? [10:40:11] sure [10:41:24] elukey, yes, and I can sudo -u to hdfs user [10:41:33] nice [10:49:05] 10Analytics, 10DC-Ops, 10decommission, 10User-Elukey: Decommission analytics1003 - https://phabricator.wikimedia.org/T206524 (10elukey) p:05Triage>03Normal [10:53:43] 10Analytics-Kanban, 10User-Elukey: Upgrade Analytics infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T192642 (10elukey) 05Open>03Resolved [10:53:48] \o/ [10:54:29] :D [12:20:30] joal: in a scale between 0 and America, how free are your right now? for 2 min in the cave [12:34:31] (03PS1) 10Fdans: [wip] Add change_tag to mediawiki_history sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465416 [12:37:35] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Q1 2018/19 Analytics procurement - https://phabricator.wikimedia.org/T198694 (10elukey) [12:39:21] Original exception: java.sql.SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://analytics1003.eqiad.wmnet:10000/default;user=yarn;password=: [12:39:24] ah! [12:39:42] In theory this one should work only restarting it [12:43:34] all the jobs failed probably had hive-site.xml cached [12:43:51] better, were reading from the version of hdfs that was still mentioning an1003 [12:47:15] !log re-run apis-wf-2018-10-9-8 [12:47:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:48:54] !log re-run all the failed projectview-hourly-coord and aqs-hourly-coord workflows (restarting them via hue) [12:48:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:55:44] lovely it keeps failing [13:01:21] 10Analytics: Many client side errors on citation data, significant percentages of data lost - https://phabricator.wikimedia.org/T206083 (10bmansurov) @Miriam any updates on this? Did you get a chance to talk with Michele and Tiziano? [13:04:13] the main cause seems to be [13:04:14] Caused by: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out (Connection timed out) [13:04:28] so I am pretty sure that they are still using an1003 [13:08:20] Heya fdans - I was in america, now distance 0 :) [13:09:47] joal: [13:09:48] elukey@stat1004:/mnt/hdfs/user/oozie$ grep -rni analytics1003 * [13:09:48] share/lib/lib_20170228165236/spark/hive-site.xml:17: thrift://analytics1003.eqiad.wmnet:9083 [13:09:51] share/lib/lib_20170228165236/spark2.2.1/hive-site.xml:17: thrift://analytics1003.eqiad.wmnet:9083 [13:09:54] share/lib/lib_20170228165236/spark2.3.0/hive-site.xml:17: thrift://analytics1003.eqiad.wmnet:9083 [13:09:57] share/lib/lib_20170228165236/spark2.3.1/hive-site.xml:17: thrift://analytics1003.eqiad.wmnet:9083 [13:10:00] * elukey cries in a corner [13:10:05] wow [13:11:25] no idea if this is used or not [13:11:38] but an1003 is cached in other places because some jobs are still failing [13:11:50] elukey: I think it would be used by spark jobs [13:12:00] We're gonna know soon [13:12:09] well we have a ton of failures :D [13:12:10] about other jobs failing, this is weird :( [13:12:19] but those are not spark's right? [13:12:57] Most are not (i don't know about API [13:13:08] I've seen you;ve restarted the failed ones? [13:13:24] yeah, failed again [13:13:41] ok - API is spark [13:13:43] RECOVERY - Check the last execution of check_webrequest_partitions on an-coord1001 is OK: OK: Status of the systemd unit check_webrequest_partitions [13:14:56] !log rerun failed aqs-hourl jobs [13:14:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:15:50] I already did it joal, but they failed.. did you change anything? [13:16:05] How crap - didn't notice there was 2 emails [13:16:09] sorry [13:16:11] hm [13:16:54] Ok understood [13:17:26] elukey: the /user/hive/hive-site.xml file is in use since about more than a year I htink [13:17:52] I am wondering if /usr/local/bin/spark2_oozie_sharelib_install is the issue [13:18:04] Before that, we were using a hive-site.xml file copied on HDFS with refiner [13:18:06] Before that, we were using a hive-site.xml file copied on HDFS with refinery [13:18:15] # If running on an oozie server, we can build and install a spark2 [13:18:18] # sharelib in HDFS so that oozie actions can launch spark2 jobs. [13:18:30] We started to use the new to prevent the exact error we're having now: having to restart everything for a change in hive-site.xml [13:19:02] If you look at the config on failing projectview or AQS, you'll see hive-site is not /user/hive/... [13:19:22] Ok - Restarting the failing jobs with correct config (and sending associated patch) [13:20:54] ahhhh [13:23:06] elukey: interesting ! those jobs conf have been updated already [13:23:17] They must not have been restarted since a long time :) [13:23:22] Doing so [13:24:04] (03PS1) 10Joal: Update hive-site.xml path in spark util [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465422 [13:25:52] !log full restart of projectview_hourly [13:25:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:26:05] need go afk for ~1h, but will try to check! [13:26:50] !log Full restart of aqs oozie job [13:26:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:29:15] Ok problem solved for aqs_hourly and projectview_hourly [13:29:20] APIs left [13:30:43] I confirm spark jobs making use of hive-metastore are stuck :S [13:31:06] ottomata: as a good morning I need some help :S [13:32:53] hiiii [13:33:01] hm ok [13:33:02] o/ ! [13:33:06] what's up? [13:33:26] an-coord1001 is live and analtics1003 is dead :) [13:33:35] but there still are some issues left [13:34:22] Namely, oozie sharelibs for spark each have a copy of hive-site.xml referencing metastore being analytics1003 [13:34:26] ottomata: --^ [13:36:17] ohhh [13:36:20] ok [13:36:28] annoying sharelibs [13:36:39] !log fully restart projectview_geo oozier job [13:36:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:37:34] ottomata: Not sure if we need to fully make them again, of if a change of file is enough [13:37:49] looking [13:39:32] hm, spark2.3.1 doesn' thave hive-site [13:39:37] for spark 1 joal? [13:40:05] or, did you remove it? [13:40:09] hm, i see it in spark2.3.0 [13:40:29] It should be in spark2.3.1 per elukey searhc [13:40:41] its not, but it should be, maybe it was removed? [13:40:43] i will put it there [13:41:10] weird ottomata [13:41:41] 10Analytics: Many client side errors on citation data, significant percentages of data lost - https://phabricator.wikimedia.org/T206083 (10Miriam) @bmansurov yes, sorry for the delay. We propose to cap the citation text in order to avoid these errors. Would that be ok? Thanks! [13:42:30] joal: i'm also going to remove the spark2.2.1 and spark2.3.0 sharelib dirs [13:42:37] hive-site.xml is now in 2.3.1 [13:42:41] sooo, try a job? [13:42:47] Will try one yes :) [13:43:07] About other libs, we should make sure they're not used anymore before removing? [13:43:21] Actually, it'll be a good way to know wherre they're still in use :) [13:43:50] haha too late! (they are in trash) [13:44:00] but yeah i doubt they are used, since we don't have those spark .debs installed anymore [13:44:02] so they shouldn't be! [13:44:25] ottomata: The apis jobs for instance still uses 2.3.0 :) [13:44:28] Restarting it now [13:44:47] :o [13:45:08] joal that is a little weird, I should add that to upgrade steps for spark 2 somewhere [13:45:35] (03PS2) 10Joal: Correct oozie jobs after move to an-coord1001 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465422 [13:46:26] !log Restarting oozie-api job [13:46:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:49:59] actually ottomata - All our oozie spark job were still on spark2.3.0 :)( [13:50:03] Restarting them now [13:50:09] yikes sorry [13:50:11] i can put it back joal? [13:50:23] this is probably better though^ [13:50:57] I'll do it [13:51:03] it's better this way [13:51:12] ok [13:51:51] ottomata: Still a failure [13:52:03] Now spark is ok with new sharelib, but the job fails [13:53:07] table not found ottomata - I assume it's a related issue :) [13:53:23] ottomata: Have you run that command in oozie about the change of something in sharelib? [13:54:34] joal: no i doubt it'd be needed or do anything, since we didn't make a new sharelib [13:54:49] hm [13:54:51] maybe oozie itself needs a restart? maybe it caches that value? [13:54:51] hm [13:54:57] we can do that [13:54:58] seems unlikely [13:55:05] since it is in the job config [13:55:14] joal: is that the same error from before? table not found? [13:55:36] the value is in jobconfig (spark-share-lib) - I think oozie caches the sharelib content though [13:55:54] ottomata: I have not double checked previous error :( [13:55:57] hm [13:56:03] ok let's bounce oozie server, can/should I just do that [13:56:10] please [13:56:10] ? [13:56:19] !log bouncing oozie server on an-coord1001 [13:56:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:56:27] ottomata: I also think that command about sharelib update could be good [13:57:38] will run it [13:57:39] can't hurt [13:57:51] done [13:58:14] Rerunning a job [13:59:55] Success ottomata :) [14:00:19] yes! [14:00:36] ottomata: need to catch the kids, will be back for standup - Still to do: restart oozie-spark jobs with new spark-share-lib - Will do wen I'm back [14:00:37] hhai wonder which did the trick! shoulda tried a controlled experiement first! [14:00:44] :D [14:00:44] i betha oozie restart would ahve been enough [14:00:52] i doubt update sharelib would have done anythign [14:01:05] joal: i can work on that, are there more that need to be committed? [14:01:12] more changes in refinery oozie porperties? [14:01:21] ottomata: git st [14:01:24] oops [14:01:54] (03PS3) 10Joal: Correct oozie jobs after move to an-coord1001 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465422 [14:01:57] ottomata: --^ [14:02:37] ottomata: Don't worry I'll restart the jobs manually with the settings in an hour or so - I have MWH-reduced to maonitor, so I'd arther be on it if ou don't mind [14:03:08] ok... [14:26:13] (03CR) 10Ottomata: [C: 031] "I like it." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465202 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal) [14:26:47] 10Analytics, 10Analytics-Kanban, 10MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), 10Patch-For-Review: Improve Dashiki extension messaging - https://phabricator.wikimedia.org/T205644 (10Milimetric) No, this isn't deployed. There are two gerrit changes: 1. https://gerrit.wikimedia.org/r/463309 impl... [14:28:54] mforns: are you working on this? https://phabricator.wikimedia.org/T199693 [14:29:00] it's moved into kanban but with no assignee [14:29:31] milimetric, no! I moved it there yesterday, because I was sharing screen in groskin' meeting [14:29:42] we said that someone should grab it [14:29:57] hm... I have some nits on our process, will bring up today [14:29:59] that's why it's assigned to no one [14:30:07] hehe ok [14:36:54] (03CR) 10Ottomata: "Hm, so I'd love to be able to easily grasp all the parts here without needing a walkthrough. I mostly understand, but since it isn't obvi" (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465206 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal) [14:46:41] hello!! [14:47:11] ottomata,joal - sorry I forgot to log in here, I tried to remove the spark 2.3.1 lib from the oozie hdfs dir and re-create it [14:47:22] but hive-site.xml was not added for some reason [14:47:36] (then I had to go afk and I forgot to ask sorry) [14:47:46] ottomata: how dod you fix it? Manually copied the file? [14:49:03] yeah [14:49:07] and restarted oozie? [14:49:18] nope I didn't [14:52:20] i did [14:52:23] that seemed to do it [14:52:26] but i'm not certain why [14:53:17] !log Restart clickstream oozie job to pick new spark-lib [14:53:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:53:44] ottomata: ah ack! [14:53:55] so now if I got it correctly we should be good right? [14:53:59] nothing exploding [14:54:58] about https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465381/, let's decide where to put it [14:56:19] !log Restart check_denormalize oozie job [14:56:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:57:32] !log restart mediawiki-history denormalize oozie job [14:57:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:58:58] elukey: yeah it looks good now, joal is restarting some jobs to update the sharelib path (we were using an older spark2 version [14:58:58] ) [14:59:34] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Beta Release: Wikistats: support annotations in graphs - https://phabricator.wikimedia.org/T178015 (10Milimetric) [14:59:36] 10Analytics, 10Analytics-Wikistats: Read Dashiki annotations into Wikistats - https://phabricator.wikimedia.org/T194702 (10Milimetric) 05declined>03Open The task has more than just the title, and some of it still needs to get done. [15:00:12] !log restart wikidata-article-placeholder oozie job [15:00:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:01:41] nuria: yoohooo [15:04:07] (03CR) 10Milimetric: [C: 032] Change the label to the last day of the week [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/465152 (https://phabricator.wikimedia.org/T206456) (owner: 10Amire80) [15:06:00] ottomata: on friday we were chatting about this (eventlog1002) [15:06:01] /dev/mapper/eventlog1002--vg-data 870G 717G 110G 87% /srv [15:06:12] oo [15:06:21] yso much? oh because we have more events [15:06:22] hm [15:06:30] 10Analytics, 10Analytics-Kanban: Table view of timely results in wikistats 2 should be ordered in time descending - https://phabricator.wikimedia.org/T199693 (10Milimetric) a:03Milimetric [15:06:48] 10Analytics, 10Patch-For-Review: Time dimension carried on url for top metrics - https://phabricator.wikimedia.org/T206479 (10Nuria) a:05fdans>03Nuria [15:08:08] !log restart wikidata-specialentites oozie job [15:08:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:08:13] !log restart wikidata-coeditors oozie job [15:08:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:08:53] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Time dimension carried on url for top metrics - https://phabricator.wikimedia.org/T206479 (10Nuria) [15:10:30] (03CR) 10Ottomata: [C: 031] Update DataFrameToHive for dynamic partitions (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465202 (https://phabricator.wikimedia.org/T164020) (owner: 10Joal) [15:10:35] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Time dimension carried on url for top metrics - https://phabricator.wikimedia.org/T206479 (10Nuria) [15:10:39] !log restart Mediawiki-history-reduced [15:10:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:10:58] 10Analytics: Many client side errors on citation data, significant percentages of data lost - https://phabricator.wikimedia.org/T206083 (10bmansurov) According to [[ https://grafana.wikimedia.org/dashboard/db/eventlogging?orgId=1&from=now-7d&to=now-5m&var-datasource=eqiad%20prometheus%2Fops&var-topic=eventloggin... [15:11:45] (03CR) 10Milimetric: [C: 032] "I have marked the jobs to be rerun since the beginning of time. This should happen over the next few hours and the data will automaticall" [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/465152 (https://phabricator.wikimedia.org/T206456) (owner: 10Amire80) [15:13:46] 10Analytics: eventlogging logs taking a huge amount of space on eventlog1002 and stat1005 - https://phabricator.wikimedia.org/T206542 (10elukey) p:05Triage>03High [15:13:53] ottomata: opened --^ to track the issue [15:17:29] hm, interesting. we keep for 30 days. [15:17:33] not really that long... [15:33:33] 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10JAllemandou) Right, I get it now :) We discussed withbthe team and our plan is to change how we detect/filter pageviews from a domain pe... [15:36:51] 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10bd808) Adding @harej and @srodlund as subscribers as I think they will be interested in the outcome here. [15:44:31] ottomata: are we doing this MEP meeting? [15:45:00] elukey, is it possible for me to test something in Turnilo's config.yaml file? How could I do it? [15:45:44] milimetric: yes [15:52:38] mforns: in theory we could test it live on the host, but if it is a quick thing.. otherwise I can try to set up something in labs [15:53:29] elukey, it would be adding a measure to a given datasource [15:53:49] with a formula [15:54:09] we can quickly try on the fly [15:54:28] ok, let me know when it's good for you, it doesn't neet to be today [15:55:30] mforns: now it is fine, can you give me the change? [15:55:40] elukey, yes, one minute [15:55:59] elukey, should I create a puppet patch? [15:56:29] mforns: if you want to make it permanent afterwards yes [15:56:40] 10Analytics, 10Cloud-Services, 10Pageviews-API, 10wikitech.wikimedia.org: wikitech.wikimedia.org missing from pageviews API - https://phabricator.wikimedia.org/T153821 (10Nuria) Clarifying: - wikitech pageviews can now be computed as now wikitech wiki is behind varnish, webrequest table gets all data (... [15:57:31] joal: added note to ticket: https://phabricator.wikimedia.org/T153821 [15:57:43] elukey, actually, I need more time to familiarize with the config syntax, not sure if I can just add one measure to a given datasource or I have to specify everything else for that datasource as well... [15:58:20] elukey, let me ping you tomorrow, and I'll have something ready-ish [15:58:29] ack! [15:58:44] thaaanks :] [15:59:27] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Beta Release: Wikistats: support annotations in graphs - https://phabricator.wikimedia.org/T178015 (10Nuria) @milimetric: i have transfer most dashiki annotations (there were not that many) by hand, is this task still needed? [16:02:14] ottomata: jumping in? https://meet.google.com/dcd-vvqb-dhd [16:02:50] OH OOPS wow how did 5 minutes go by [16:26:32] 10Analytics, 10User-Elukey: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 (10AndyRussG) Hi!!! Many apologies for the delay here... I think it makes sense to build the realtime data consumer based on the EventLogging stream. The only drawback would be that initiall... [16:36:09] really weird [16:36:10] Could not open client transport with JDBC Uri: jdbc:hive2://analytics1003.eqiad.wmnet:10000/default;user=yarn;password=: java.net.ConnectException: Connection timed out (Connection timed out)) [16:36:16] this is from 15mins ago [16:45:50] but it happens only for TestSearchSatisfaction2 and SearchSatisfaction [16:46:46] HMMMM [16:47:00] yeah saw those in refine alert, been meanign to check in after emails/lunch [16:47:05] strange it is only those too.... [16:47:06] hm [16:47:22] I am wondering if those are handled separately by say Erik [16:47:30] mmm ya [16:48:14] are they coming to kafka via EL javascript client or serverr side? ... although for refine that would not matter [16:52:11] elukey: i deleted those alarms cause i assumed it was teh switch w/o noticing they were only for those schemas [16:53:23] yep yep, I noticed analytics1003 in those for a recent alert and I thought it was weird [16:54:47] ottomata: plis ping me when you are looking into it [16:57:03] elukey: https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&var-schema=SearchSatisfaction [16:57:08] elukey: it has no traffic [17:01:53] making lunch... [17:02:02] but yeah, even so, where would it be getting analytics 1003 from??? [17:18:49] ottomata: from a stale dns [17:18:54] naww [17:18:56] ottomata: jdbc connection [17:19:13] its a short lived cron job [17:20:25] ottomata: and we know for sure that it is generated from our cron right? [17:20:37] I know!~ [17:20:38] ottomata: mmmm.. the parameters of the connection it opens might not be so short lived as the cron [17:20:57] refinery source code has analytics1003 as a default param, and it isn't overridden in the properties [17:21:25] i will remove the default... [17:21:28] and manually set it [17:21:33] ah there you go :) [17:22:00] elukey: so i understand, the analytics1003 is no longer used for these jobs anymore, right? [17:22:04] elukey: is there anywhere in puppet that hive server url is set...? looking [17:23:47] would like to use it from hiera rather than hardcoding in ::job::refine class [17:23:52] looks like not thoug [17:24:35] nuria: yep, nothing runs on it anymore [17:24:51] oh yes there is [17:24:54] i must ahve been grepping wrong [17:25:02] profile hive client has it [17:28:26] elukey: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465458/1/modules/profile/manifests/analytics/refinery/job/refine.pp [17:29:47] sure but not analytics1003 no? [17:29:57] ottomata: i still do not understand why it wouldn't fail for all schemas though. [17:30:32] (03PS1) 10Ottomata: Update default value of Refine hive_server_url [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465459 (https://phabricator.wikimedia.org/T205509) [17:30:44] nuria: i don't get that either...maybe it is failing too early [17:31:04] elukey: ? [17:31:19] sorry I didn't get that you were fixing it [17:31:26] going to review in a sec, I am merging another change [17:31:31] k [17:31:34] nuria: i'm going to try to rerun [17:31:37] want to do it with me and see? [17:31:39] to practice? [17:31:48] we can override the CLI opt and set manually when we try [17:31:59] bc? [17:32:06] ottomata: yess [17:32:08] k [17:32:16] ottomata: let me get headset [17:33:19] ottomata: on bc [17:36:06] 10Analytics, 10Operations, 10ops-eqiad: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10Cmjohnson) @elukey I am in conversation with DELL about the server, getting them the info they need.....nothing has been decided yet but as soon as they tell me what they're sending (should be a... [17:38:45] (03CR) 10Elukey: [C: 031] Update default value of Refine hive_server_url [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465459 (https://phabricator.wikimedia.org/T205509) (owner: 10Ottomata) [17:39:23] ottomata: did you also send a cdh module change too? [17:39:41] I can see in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465458/ the submodule updated [17:40:09] but maybe that was my last change [17:44:51] 10Analytics, 10Analytics-Kanban: is-yarn-app-running script should output the running application id - https://phabricator.wikimedia.org/T206555 (10Ottomata) [17:44:57] oops i didn't mean to elukey [17:45:02] fixig [17:45:15] fixing it now [17:45:18] should be ready [17:45:32] oh [17:45:33] oop [17:45:34] s [17:45:49] gr8 danke [17:48:07] ottomata: about Nuria's email for the ReadingDepth refine failure - can you give me some hints about how to check what made the refinement to fail? Yarn logs and /var/log/refinery ones do not contain much afaict [17:48:11] (but I could be wrong!) [17:48:37] elukey: looking at that now with andrew on bc [17:48:48] elukey: i will follow up [17:50:48] ottomata: how come the default value in refine-params has affected only a single schema? I don't get it [17:51:36] nuria: ah great! Will wait for the mail update [17:51:49] joal: ya, we do not get it either [17:52:18] Ah ok :) I feel less alone - I'll sta with elukey, waiting for news from the frontline [17:53:07] I am curious to know if the other refinement problem (ReadingDepth) it is due to the jvm's direct memory settings or soemthing else [17:53:18] anyhow, dinner time, going offline team! [17:53:19] joal: i don't get that yet either [17:53:32] elukey: come to batcave and discuss! :) [17:53:35] bye elukey - We'll talk again about direct memory :) [17:53:40] ok nm byyeee [17:54:20] ottomata: The direct memory seems related to shuffle - might be interesting to see if no-dynamic-allocation help [17:54:21] ottomata: I could but given the fact that it is 8 PM in here Marika might kill me :D [17:54:39] elukey: Please don't risk that :) [17:54:44] :D [17:54:47] byyeee [17:58:44] np! byyee [18:24:25] !log adding Accept header to all varnishkafka generated webrequest logs [18:24:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:30:34] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Ottomata) Just added the `accept` field to the varnishkafka generated webrequest logs. @JAllemandou I haven't done this in a while, I'll ping you in m... [18:31:30] (03CR) 10Ottomata: [C: 032] Update default value of Refine hive_server_url [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465459 (https://phabricator.wikimedia.org/T205509) (owner: 10Ottomata) [18:56:01] (03PS1) 10Ottomata: Make is-yarn-application-running --verbose more informative [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465471 (https://phabricator.wikimedia.org/T206555) [19:04:57] ottomata: I observe weird behaviors in DataFrameToHive [19:05:02] Have a minute? [19:05:33] joal ya [19:05:37] bc? [19:05:56] OMW [19:42:27] ottomata: The alter is actually fired at every run [19:42:39] I think we need to find why and try to prevent :) [19:43:05] I have also found another bug - I don't even understand how it was not failing before [19:47:01] joal: like the regular refine is doing that every time too? [19:47:10] joal bc again? [19:47:11] ottomata: possibly yes ! [19:47:13] sure [19:47:21] joal i don't think it is...i think i would see it [19:47:27] when i run it manually [19:47:28] but maybe not! [20:06:30] 10Analytics-Kanban, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10Nuria) @Tbayer, you do not need any special permissions to access any type of data, the datasources that were accessible through these permits have sinc... [20:21:14] ottomata: Interesting finding! two spark-sql types are different if they don't have the same metadata (comments for instance) [20:21:23] This is where my repetition comes from [20:24:43] 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings, 10Product-Analytics, and 3 others: Ingest data from PageIssues EventLogging schema into Druid - https://phabricator.wikimedia.org/T202751 (10mforns) @Tbayer > @mforns Great to hear that Druid already allows ingestion of array types! But just to clar... [20:26:19] OHHHH [20:26:22] interesting [20:26:34] So when we use Seq.diff --> it uses equal [20:26:36] joal why don't they have the same comments? oh because source data is from parquet files instead of table? [20:26:48] correct [20:26:51] hmmm [20:27:03] but the json data doesn't have comments... [20:27:10] Same in json for instance - Except that since refine created the tables, no comments (no problem [20:27:16] ohhhhHHHHH [20:27:17] right. [20:27:26] huh [20:27:33] Looking for an elegant patch [20:27:43] so the alter is removing the comments everytime [20:27:49] joal: couldnt' you just let your job create the table? [20:28:11] ottomata: For sure, I could even create the table without comments :) [20:28:31] aye [20:28:47] Now an interesting part is that it only tries to alter the subobject filed, not others [20:28:52] it sucks that we are losing the comments here, but maybe the convert to schema is doing the right thing here! [20:28:56] While others have comments too [20:28:57] oh that is interesting [20:29:22] if the incoming schema had comments, we'd want it to keep them on the output schema [20:29:25] I have no clue why [20:29:44] indeed ottomata - I'm gonna make sure this is what happenms [20:30:01] Enough for tonight though :) [20:30:10] I'll keep on searching on that tomorrow [20:31:14] great find! [21:14:45] 10Analytics-Kanban, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10Tbayer) >>! In T178802#4653097, @Nuria wrote: > @Tbayer, you do not need any special permissions to access any type of data, the datasources that were a... [21:45:48] milimetric: yt? [21:46:30] no, out picking up steph, what’s up nuria [21:46:49] milimetric: that's fine, ping me if/when you get back online [22:23:08] regarding the above discussion about analytics1003: does this mean that the entire server is renamed now? [22:23:26] it is used in a lot of other contexts, cf. https://wikitech.wikimedia.org/w/index.php?search=analytics1003&title=Special%3ASearch&go=Go [22:25:18] e.g. groceryheist and i couldn't run hive queries from SWAP today as documented at https://wikitech.wikimedia.org/wiki/SWAP#Querying_data ... worked only after changing analytics1003 to an-coord1001 [23:08:28] (03PS3) 10Nuria: [WIP] Time dimension should be reseted to "1-Month" for top metrics [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/465296 (https://phabricator.wikimedia.org/T206479) [23:17:36] (03PS1) 10Mforns: Refactor EventLoggingToDruid to use whitelists and ConfigHelper [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465532 (https://phabricator.wikimedia.org/T206342) [23:18:19] (03CR) 10Mforns: [C: 04-2] "Still need to figure out how to use properties file with ConfigHelper." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465532 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [23:31:37] HaeB: yes, all references need to be updated as that host no longer exists [23:33:57] ok, thanks for clarifying - this wasn't apparent from the announcement https://lists.wikimedia.org/pipermail/wiki-research-l/2018-October/006477.html (CC elukey neilpquinn ) [23:44:38] sorry HaeB that's a documentation update lag on our part [23:55:42] nuria: hi, back, what's up