[03:09:51] 10Analytics, 10Pageviews-API, 10Tool-Pageviews: Userviews with redirects generates lots of errors - https://phabricator.wikimedia.org/T191866#4119156 (10kaldari) [05:03:07] 10Analytics, 10Analytics-Wikistats: Routing code allows invalid routes - https://phabricator.wikimedia.org/T188792#4119230 (10Bucky199191) p:05Unbreak!>03Triage [05:14:00] 10Analytics, 10Analytics-Wikistats: Routing code allows invalid routes - https://phabricator.wikimedia.org/T188792#4119249 (10JJMC89) p:05Triage>03Unbreak! [05:27:32] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is OK: OK - scalar(max(max_over_time(kafka_burrow_partition_lag{group=kafka-mirror-main-eqiad_to_jumbo-eqiad,topic!.*change-prop.*} [10m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad [05:28:37] finally --!! [05:28:52] it was the "!" char that upset the nagios check [05:40:32] 10Analytics, 10Pageviews-API, 10Tool-Pageviews: Userviews with redirects generates lots of errors - https://phabricator.wikimedia.org/T191866#4119279 (10MusikAnimal) Yeah, this is one of the [[ https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Gotchas | Gotchas ]] with the API. In general, the 404... [05:40:52] 10Analytics, 10Pageviews-API, 10Tool-Pageviews: [Userviews] Treat 404 API errors as zero pageviews - https://phabricator.wikimedia.org/T191866#4119280 (10MusikAnimal) [06:38:46] 10Analytics, 10User-Elukey: Report updater setting log ownership incorrectly (leading to cronspam) - https://phabricator.wikimedia.org/T191871#4119336 (10elukey) [06:39:12] 10Analytics, 10User-Elukey: Report updater setting log ownership incorrectly (leading to cronspam) - https://phabricator.wikimedia.org/T191871#4119347 (10elukey) [06:46:49] 10Analytics, 10Operations, 10Ops-Access-Requests: Access to stat100x and notebook1003.eqiad.wmnet for Jonas Kress - https://phabricator.wikimedia.org/T191308#4119350 (10Jonas) So what is the status here? [07:13:18] joal: o/ [07:13:27] when you are online can we check --^ [07:13:57] I think Jonas needs access to statistics-users or researchers [07:17:05] Hi elukey [07:17:45] elukey: how do you wish me tio help? [07:19:52] joal: mmm I am reading https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Host_access_granted and in theory he should have already access to stat100[46] [07:20:01] nevermind, I'll triple check and update the taks :) [07:20:17] elukey: let me know if I can help :) [07:23:40] 10Analytics, 10Operations, 10Ops-Access-Requests: Access to stat100x and notebook1003.eqiad.wmnet for Jonas Kress - https://phabricator.wikimedia.org/T191308#4100865 (10elukey) So Jonas (user: jk) is already in analytics-privatedata-users, and as far as I can see access is already granted for notebook1003, s... [07:43:56] joal: https://github.com/criteo/babar seems interesting, especially for the flame graphs! [08:11:24] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4119479 (10elukey) A bit of historic context about the why db1108 is not read-only: ``` # History context: there used to be a distinction b... [08:47:33] ok so I just merged a change that points the m4-master domain (the one that the el mysql consumers use) directly to db1107 (the master db) rather than going through the proxy [08:47:56] now I am going to update the docs [08:57:41] https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration#Mysql_insertion_rate_dropping_to_zero_due_to_db_failures [08:58:02] 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4119539 (10elukey) Created https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration#Mysql_inserti... [09:00:44] !log restart eventlogging mysql consumers on eventlog1002 to pick up new DNS changes for m4-master - T188991 [09:00:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:00:46] T188991: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991 [09:12:44] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10User-Elukey: [EL sanitization] Ensure presence of EL YAML whitelist in analytics1003 - https://phabricator.wikimedia.org/T189691#4119574 (10elukey) @mforns sorry I don't find the whitelist, can you add it in here? Moreover, do you need that puppe... [10:07:08] 10Analytics, 10Operations, 10Ops-Access-Requests: Access to stat100x and notebook1003.eqiad.wmnet for Jonas Kress - https://phabricator.wikimedia.org/T191308#4119726 (10Jonas) 05Open>03Resolved a:03Jonas Thanks for looking into it! [10:28:26] * elukey lunch + errand! be back in ~2h [13:04:48] yoo hoo joal [13:04:56] Hey ! ottomata :) [13:04:59] Good morning :) [13:05:32] good afternoon! [13:05:58] oook joal, plannniiing [13:06:12] we need to stop refine jobs before deploy and before yarn shuffle patch [13:06:18] do we need to stop other oozie jobs too? [13:06:28] ottomata: I think easiest is actually to stop camus [13:06:34] ok [13:06:45] Like that, nothing else to stop on the webrequest side [13:07:12] ok, but we are actually going to do that right after standup, right? [13:07:44] ottomata: I suggest we stop just before standup, like it'll be drained for after [13:08:11] ok great [13:08:26] https://etherpad.wikimedia.org/p/spark2-planning [13:09:03] hello people, if you deploy refinery I'll restart the namenodes to pick up the hdfs trash thing :) [13:10:05] oh great, yeah, we'll do that right after standup [13:14:22] joal the after part for relanuching jobs [13:14:28] those are nice because we do them one by one [13:14:37] and that allows us to make sure each one works? [13:14:59] ottomata: yes, for the ones running on a regular bases [13:15:16] ottomata: For the ones running not that often, we'll have to wait [13:15:29] ottomata: Do we find a name for that spark package? [13:15:57] ottomata: how about refinery-sparklib [13:17:26] oh right [13:18:11] joal: i dont' like 'spark' in the name, because you aren't making this thing to extend or add 'spark' specific functionality, in the same way that refinery-hive is [13:18:17] maybe refinery-spark is the name [13:18:34] remind me why we can't have spark in refinery-core? [13:18:43] ottomata: I picked that one since most of the libs are for spark-related jobs [13:18:46] and/or why we don't want to just leave this in refinery-job? [13:19:17] ottomata: better dependency separation (no scala nor spark as dependency of refinery-core) [13:19:30] ottomata: But if you prefer we could add them there [13:19:33] joal but why does that help us? just curious [13:19:52] i think for now i'd just leave it in job rather than core, but that doesn't mean it doesn't belong somewhere other than job [13:19:58] job actually makes sense to me for most of these things [13:20:00] refinery-hive doesn't need to depend on sprak stuff [13:20:01] PageHistoryBuilder [13:20:07] is job specific code [13:20:09] not a spark library [13:20:21] right ottomata [13:20:52] GraphiteClient, DataFrameToDruid, SparkSQLHiveExtensions, those are kinda generic spark libs [13:21:08] actually ha Graphite client isn't [13:21:11] its just scala [13:21:54] joal it sounds like you just want scala stuff in its own module? [13:22:43] ottomata: true [13:22:53] ottomata: I also prefer smaller packages [13:23:05] you want more modules? [13:23:17] ottomata: why not ! [13:23:41] we could do a refinery-spark for things like SparkSQLHiveExtensions, etc., and a refinery-wmf or refinery mediawiki or refinery-wikimedia for mw history (and maybe webrequest too?) [13:23:41] ottomata: I';m after trying to make things not having to depend on everything else if they don't need to [13:26:00] aye [13:26:13] joal: i am not sure we are going to get away with never having scala in refinery-core [13:26:17] spark sure [13:26:37] refinery-job has since been the place to put all spark stuff, just so we didn't put it in core [13:26:43] so you are trying to make an in between [13:27:11] joal: perhaps, we can have a refinery-spark, and just put very spark things in there, like DataFrameToHive, etc., but nto PageHistoryBuilder [13:27:14] ottomata: let's put scala in refinery-core then :) [13:27:14] those should probably just stay in job [13:27:23] nothing depends on refinery-job, right? [13:27:24] ottomata: ok for me [13:27:32] ottomata: nope [13:27:50] the idea with refinery-spark is that it can be not-shaded (in opposition to refinery-job [13:28:16] aye, but do we need that? [13:28:31] I think I'd put scala only things like GraphiteClient and HivePartition in refinery-core [13:28:43] then spark lib type things like DataFrameToHive in refinery-spark [13:28:53] and any wmf specific code stays in refinery-job [13:29:27] things like SubgraphPartitioner could go in refinery-spark [13:29:27] ottomata: works for me [13:29:42] LogHelper to core, etc. [13:29:47] joal: can I help with this? [13:30:12] ottomata: if you want to do it, feel free to modify my patch - If not, I'll do it now :) [13:30:22] ottomata: I was also looking at he refine-accumulator patch [13:30:36] ottomata: This is not valid anymore, since we don't iterate over rows [13:31:09] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10User-Elukey: [EL sanitization] Ensure presence of EL YAML whitelist in analytics1003 - https://phabricator.wikimedia.org/T189691#4120161 (10mforns) The whitelist is in the other task :] >>! In T189690#4051392, @mforns wrote: > The resulting new... [13:31:14] maybe we can both work on it :p i can start with refinery-core , you do refinery job, then we'll see what is left in refinery-spark? [13:31:20] and hopefully we won't conflict? :p [13:31:22] ottomata: going to merge https://gerrit.wikimedia.org/r/#/c/421891/ and then restart kafka-jumbo1001, ok? [13:31:33] elukey: ya +1, [13:31:53] joal: i forgot about accumulator [13:32:13] right because we cast in sql now [13:32:14] hm ok [13:32:16] i guess we abandon? [13:32:24] ottomata: I think so yes [13:32:56] (03Abandoned) 10Joal: Use an accumulator to count in spark Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/413689 (owner: 10Joal) [13:35:03] ottomata: I move back every mediawiki-history related code into job, except for the utilities [13:35:07] ok with you? [13:36:21] ok joa ya [13:36:46] joal: you can also do I think the Refine entry job script if you like [13:36:49] hmm [13:36:52] or maybe that can stay in spark [13:36:54] not sure about that one [13:37:16] If it's an entry job, let's have in job? [13:37:24] k [13:37:28] ya [13:37:49] joal, hm just noticed this spark.connectors [13:37:50] hm [13:38:15] should DataFrameTOHive go thtere? [13:38:22] dunno [13:38:39] TransformFunctions should probably go to job [13:38:40] ottomata: why not - It's just bizarre to separate it from the other refine-oriented classes [13:38:42] along with Refine [13:38:57] joal maybe we will get rid of this spark.refine level now too [13:39:02] ok [13:41:20] joal i betcha we will want to eventually move webreuest stuff from core -> job or something [13:41:23] (not now though) [13:41:36] kafka-jumbo1001 restarted [13:42:05] ottomata: hm - maybe? [13:51:22] ottomata: shall I make a utilities package in refinery-spark handling MWH and Refine utility classes? [13:51:24] joal: RefineTarget should move with Refine [13:51:35] Refine is already in job [13:51:42] aye, i think RefineTarget should go there too [13:52:08] ottomata: ok [13:52:27] ottomata: Moving them back into their own package [13:52:49] k [13:52:55] job.refine for those 2 sounds fine [13:53:02] as for utils [13:53:02] hm [13:53:08] sure! [13:53:18] would be nice to think of a better name, but i think for now spark.utils for those is fine [13:53:19] ottomata: I'll also move EventLoggingSanitization into refine? [13:53:25] yes [13:53:33] Ok [13:53:34] that's fine [13:53:45] TimestampHelpser hm, that could go in core :p [13:53:46] And DataFRameToHive+HivePartition in connectors [13:53:50] huhu [13:53:52] true [13:54:00] phpunserializer too [13:54:07] ya [13:54:48] joal, should SparkSQLHiveExtensions be [13:54:50] refinery.spark. [13:54:51] or [13:54:54] refinery.spark.sql. [13:54:54] ? [13:55:02] if it was in spark upstream, it'd be spark.sql [13:55:08] yup [13:55:10] then we coudl call it [13:55:15] spark.sql.HiveExtensions :p [13:55:17] ! [13:55:27] i think that sounds nice [13:55:34] wow this is a biiiig refactor! hoo boyyy [13:55:34] ok for spark.sql [14:00:49] joal do you do this using idea refactor menu? [14:00:51] it kinda is working [14:00:53] i guess easier than manual [14:01:00] but it doesn't totally work, especially with test files moving aroudn too [14:01:18] ottomata: I'm almost done with a lot [14:01:29] ottomata: Do you mind waiting for a minute? [14:02:30] no prob [14:02:37] still need to figure out pom deps [14:02:39] since those move around alot too [14:07:06] ottomata: need to catch the kids - will be back soon with the patch [14:09:45] k np [14:20:28] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message produce rate in last 30m on einsteinium is CRITICAL: CRITICAL - scalar(sum(avg_over_time(kafka_producer_producer_metrics_record_send_rate{client_id=kafka-mirror-.+-main-eqiad_to_jumbo-eqiad@[0-9]+} [30m]))): 0.0 = 0.0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqi [14:20:58] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message consume rate in last 30m on einsteinium is CRITICAL: CRITICAL - scalar(sum(avg_over_time(kafka_consumer_consumer_fetch_manager_metrics_all_topics_records_consumed_rate{mirror_name=main-eqiad_to_jumbo-eqiad} [30m]))): 0.0 = 0.0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_ [14:22:25] hmmm [14:25:05] !log bouncing all main -> jumob mirror makers, they look stuck! [14:25:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:26:03] so a bounce of kafka-jumbo1001 caused this ? [14:26:10] elukey: is that the only node you've bounced so far? [14:26:14] yep [14:26:24] it looks like it, it look slike consumers had to rebalance due to leadership change, and then thigns got funky [14:26:27] not good! [14:26:32] nope :( [14:26:58] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message consume rate in last 30m on einsteinium is OK: OK - scalar(sum(avg_over_time(kafka_consumer_consumer_fetch_manager_metrics_all_topics_records_consumed_rate{mirror_name=main-eqiad_to_jumbo-eqiad} [30m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumb [14:27:37] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message produce rate in last 30m on einsteinium is OK: OK - scalar(sum(avg_over_time(kafka_producer_producer_metrics_record_send_rate{client_id=kafka-mirror-.+-main-eqiad_to_jumbo-eqiad@[0-9]+} [30m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad [14:27:47] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is CRITICAL: CRITICAL - scalar(max(max_over_time(kafka_burrow_partition_lag{group=kafka-mirror-main-eqiad_to_jumbo-eqiad,topic!.*change-prop.*} [10m]))): 746486.0 100000.0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad [14:43:53] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is OK: OK - scalar(max(max_over_time(kafka_burrow_partition_lag{group=kafka-mirror-main-eqiad_to_jumbo-eqiad,topic!.*change-prop.*} [10m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad [14:45:55] joal: i forgot about refinery-tools [14:46:02] maybe thsee things like ReflectUtils, GraphiteClient, etc. should go there? [14:46:04] ottomata: hm [14:46:09] ottomata: why not [14:46:39] compiling reflectutils, or maybe getting scala to compile/run tests in rfinery core [14:46:40] hm [14:46:48] going to move to refinery-tools first [14:48:02] sigh i dunno [14:48:09] we don't actually use refinery tools for anyting right now [14:48:12] we could probably just remove it [14:48:19] maybe that would be better [14:48:26] ottomata: that would work for me [14:48:34] ottomata: I think tools was for guard only [14:48:36] k, lets do that in another release though [14:48:37] ok anyway, so core [14:48:59] refinery/core/ReflectUtils.scala:3: error: object runtime is not a member of package reflect [14:48:59] [INFO] import scala.reflect.runtime.universe [14:49:17] yup [14:49:22] got the same ottomata [14:49:28] you got the same? [14:49:51] I have been moving everything discussed in the correct place, and now I have that compilation error as well [14:50:07] oh but, RefelctUtils is going to core, no? [14:50:16] I wanted to have the thing working before leaving for the kids but didn't manage [14:50:17] hm https://stackoverflow.com/questions/25189608/cant-import-scala-reflect-runtime-universe/25191687 [14:50:21] Yup [14:50:24] ook, maybe that's just change din scala 2.11 [14:50:31] joal i've been moving things to core [14:50:39] are we going to conflict? :d [14:50:45] I think we will :) [14:51:08] a-team: just sent the e-scrum sorry, will try to join retro (workers at home :( ) [14:51:10] oh boyyy [14:51:12] ottomata: It's weird that reflection didn't cause problem before [14:51:28] hm yeah, but maybe spark pulls in reflect or something? [14:51:29] dunno [14:51:35] ottomata: probably yeaj [14:51:39] joal anyway, hm if you are moving things to core too, maybe I should just stop my patcha nd let you do all? [14:51:48] i thought you were just doing spark -> job [14:51:53] ottomata: hopfully I'll be ready soon [14:53:27] mforns: thanks for the comments in the review! Don't want to be picky but to understand what's changing :( [14:53:56] elukey, not picky at all! thanks for taking the time to look at the code thoroughly :] [14:54:07] joal: so I should stop? [14:54:14] ottomata: please do [14:54:16] ok [14:54:20] ottomata: sorry I wasn't clear enough [14:54:54] (03PS1) 10Mforns: Add job and query for page previews aggregation [analytics/refinery] - 10https://gerrit.wikimedia.org/r/425281 (https://phabricator.wikimedia.org/T186728) [14:55:11] (03PS2) 10Mforns: [WIP] Add job and query for page previews aggregation [analytics/refinery] - 10https://gerrit.wikimedia.org/r/425281 (https://phabricator.wikimedia.org/T186728) [14:59:16] ottomata: Can you please stop camus npw ? [15:00:41] ping fdans ottomata [15:00:46] sorryyyyy [15:04:47] ya [15:04:49] OH STANDUP ahg [15:06:34] ottomata: Please do it on top of the spark_2 patch :) [15:11:13] will do [15:20:30] a-team: I am terribly sorry but the worker-at-home situation has not improved much, plus there seem to be some things that require my attention (sigh). If you don't mind I'd skip also retro :( [15:30:19] joal: don't need a RefineMonitor patch at all, it just shouldn't be run in YARN [15:30:29] ? [15:30:32] it doesn't use any spark stuff [15:30:32] Ah ! [15:30:37] Makes sense [15:30:39] it is all just hadoop filesystem [15:30:48] i have to use spark submit though [15:30:54] because RefineTarget does have a spark dep [15:31:00] but RefineMonitor doesn't use that method [15:31:04] hm [15:31:20] ping ottomata [15:42:47] !log disable puppet on analytics1003 and stop camus crons in preperation for spark 2 upgrade [15:42:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:43:58] ottomata: let me know afterwards if you want me to stop rolling restarting kafka-jumbo [15:50:46] (03PS3) 10Joal: Big refactor of scala and spark code [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/424569 (https://phabricator.wikimedia.org/T188025) [15:51:00] ottomata: --^ Finally !!! [15:54:28] (03Abandoned) 10Joal: Refactor refinery-job-spark-2.1 to 2.2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/405770 (owner: 10Joal) [16:00:54] (03CR) 10Joal: [C: 032] "Merging before deploy." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 (owner: 10Joal) [16:03:53] (03CR) 10Joal: [V: 032 C: 032] Upgrade scala to 2.11.7 and Spark to 2.3.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 (owner: 10Joal) [16:04:12] (03CR) 10Joal: [V: 032 C: 032] "Merge for deploy." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425084 (https://phabricator.wikimedia.org/T159962) (owner: 10Joal) [16:05:18] (03PS1) 10Joal: Revert "Upgrade scala to 2.11.7 and Spark to 2.3.0" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425295 [16:06:01] (03CR) 10Joal: [V: 032 C: 032] "Reverteing for wrong patch-set merged previously" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425295 (owner: 10Joal) [16:07:41] dcausse: question, could we move the data for this schema into hadoop: https://meta.wikimedia.org/wiki/Schema:SearchSatisfaction? [16:08:54] nuria_: I'd have to check chelsyx and bearloga can't remember if the dashboards use mysql or hadoop [16:09:14] (03PS1) 10Joal: Revert the revert of "Upgrade scala to 2.11.7 and Spark to 2.3.0" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425297 [16:09:27] and I think it's already in hadoop IIRC [16:09:44] (03CR) 10Joal: [V: 032 C: 032] "Merging this, trying to get ready for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425297 (owner: 10Joal) [16:10:08] dcausse: yes, all is in hadoop i just want to make sure it does not go to mysql as it is pretty high volume, let me triple check we did not blacklisted that before [16:12:47] dcausse: triple checked, it is ONLY in hadoop [16:12:54] dcausse: sorry for the trouble [16:13:12] nuria_: yes I was looking and found that you already blacklisted it :) [16:13:20] dcausse: super thanks [16:13:26] yw! [16:13:44] (03PS1) 10Joal: Revert "Revert "Upgrade scala to 2.11.7 and Spark to 2.3.0"" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425299 [16:14:10] (03CR) 10Joal: [V: 032 C: 032] "Trying again ... Super sorry for spams" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425299 (owner: 10Joal) [16:17:08] (03PS1) 10Ottomata: Upgrade scala to 2.11.7 and Spark to 2.3.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425302 [16:18:35] (03CR) 10Ottomata: [V: 032 C: 032] Upgrade scala to 2.11.7 and Spark to 2.3.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425302 (owner: 10Ottomata) [16:21:07] (03CR) 10Nuria: [C: 04-1] "Please jonas take a look as tests are failing." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/423336 (https://phabricator.wikimedia.org/T191714) (owner: 10Jonas Kress (WMDE)) [16:24:28] (03PS1) 10Ottomata: Add HiveServer to spark-refine for schema changes [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425306 (https://phabricator.wikimedia.org/T159962) [16:25:08] (03CR) 10Ottomata: [V: 032 C: 032] Add HiveServer to spark-refine for schema changes [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425306 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [16:25:43] (03Abandoned) 10Joal: Add HiveServer to spark-refine for schema changes [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425084 (https://phabricator.wikimedia.org/T159962) (owner: 10Joal) [16:27:33] (03PS3) 10Ottomata: Update spark jobs to use hive context [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/415812 (https://phabricator.wikimedia.org/T159962) (owner: 10Joal) [16:28:26] (03CR) 10Ottomata: [V: 032 C: 032] Update spark jobs to use hive context [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/415812 (https://phabricator.wikimedia.org/T159962) (owner: 10Joal) [16:29:19] (03PS15) 10Joal: Add by-wiki stats to MediawikiHistory job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/415255 (https://phabricator.wikimedia.org/T155507) [16:29:35] (03CR) 10Joal: [V: 032] Add by-wiki stats to MediawikiHistory job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/415255 (https://phabricator.wikimedia.org/T155507) (owner: 10Joal) [16:30:47] (03PS16) 10Joal: Update mediawiki-history spark job for performance [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/419516 (https://phabricator.wikimedia.org/T189449) [16:31:05] (03CR) 10Joal: [V: 032 C: 032] "Merging before deploy." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/419516 (https://phabricator.wikimedia.org/T189449) (owner: 10Joal) [16:31:54] mforns: While we're at it - Do you want us to merge https://gerrit.wikimedia.org/r/#/c/420795/ [16:31:57] ? [16:35:09] (03PS4) 10Joal: Big refactor of scala and spark code [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/424569 (https://phabricator.wikimedia.org/T188025) [16:35:29] (03CR) 10Joal: [V: 032 C: 032] "Merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/424569 (https://phabricator.wikimedia.org/T188025) (owner: 10Joal) [16:41:32] (03PS1) 10Joal: Fix pom error due to merge-rebase [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425307 [16:42:07] (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425307 (owner: 10Joal) [16:45:07] mforns: ping for patch before we move? [16:45:56] (03CR) 10Joal: [V: 032 C: 032] "Merging before deploy." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/425003 (owner: 10Joal) [16:47:28] (03CR) 10Joal: [V: 032 C: 032] "Merging before deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/415324 (owner: 10Joal) [16:49:10] (03CR) 10Joal: [V: 032 C: 032] "Merging before deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/424253 (https://phabricator.wikimedia.org/T190058) (owner: 10Joal) [16:52:48] (03PS1) 10Joal: Update jobs using Spark to new 0.0.60 jar [analytics/refinery] - 10https://gerrit.wikimedia.org/r/425310 [16:56:02] (03PS1) 10Joal: Fix wrong package name after refactor [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425311 [16:56:03] yw! [16:56:07] oops [16:56:27] (03CR) 10Joal: [V: 032 C: 032] "Merging before deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425311 (owner: 10Joal) [16:58:56] joal, ottomata : are you on batcave? [16:59:38] (03PS1) 10Joal: Move HivePartition to core instead of spark [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425314 [17:00:45] nuria_: yes we are [17:04:02] (03PS2) 10Joal: Move HivePartition to core instead of spark [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425314 [17:04:19] (03CR) 10Joal: [V: 032 C: 032] "Merge before deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425314 (owner: 10Joal) [17:04:59] (03CR) 10Joal: [V: 032 C: 032] "Merging before deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/425310 (owner: 10Joal) [17:06:10] (03CR) 10Nuria: "Looks good" (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/425281 (https://phabricator.wikimedia.org/T186728) (owner: 10Mforns) [17:19:34] 10Quarry: Queries hang when results have duplicate column names - https://phabricator.wikimedia.org/T191904#4120880 (10MMiller_WMF) [17:20:47] (03PS1) 10Joal: Update changelog.md to deploy v0.0.60 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425320 [17:24:27] 10Quarry: Queries hang when results have duplicate column names - https://phabricator.wikimedia.org/T191904#4120903 (10zhuyifei1999) [17:24:30] 10Quarry: Quarry cannot store results with identical column names - https://phabricator.wikimedia.org/T170464#4120906 (10zhuyifei1999) [17:25:28] joal: is it worth to add the -skipTrash thing in the changelog ? [17:25:36] 10Quarry: Quarry cannot store results with identical column names - https://phabricator.wikimedia.org/T170464#4120909 (10MMiller_WMF) Thanks @zhuyifei1999 -- I hadn't noticed that I was making a duplicate. [17:27:05] (03CR) 10Ottomata: [C: 031] Update changelog.md to deploy v0.0.60 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425320 (owner: 10Joal) [17:27:42] 10Quarry: Quarry cannot store results with identical column names - https://phabricator.wikimedia.org/T170464#4120916 (10zhuyifei1999) It's okay. If we could just somehow fix this... [17:27:51] (03CR) 10Joal: [V: 032 C: 032] "Merging before deploy." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425320 (owner: 10Joal) [17:28:26] (03CR) 10Elukey: "Extra nit: spaces after words :)" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425320 (owner: 10Joal) [17:28:39] all right I guess not :) [17:29:00] elukey: that's not in refinery, right? [17:29:40] ottomata: ah not in source yes [17:46:45] !log Refinery source 0.0.60 deployed to archiva [17:46:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:51:56] joal, sorry missed your ping [17:52:11] yes please, if you think it's mergeable [17:52:18] I tested that with real data [17:52:43] oh... already deployed [17:53:13] no problem, it will take a couple days until operational, I'm still finishing the mysql one with Luca [17:53:17] next week [17:57:10] (03CR) 10Mforns: [WIP] Add job and query for page previews aggregation (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/425281 (https://phabricator.wikimedia.org/T186728) (owner: 10Mforns) [18:05:58] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 2 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4121062 (10Pchelolo) [18:06:00] !log EDeploy refinery to HDFS [18:06:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:06:02] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (done): Add support for catch-all rule in ChangeProp - https://phabricator.wikimedia.org/T191238#4121059 (10Pchelolo) 05Open>03Resolved The feature was merged and released for the JobQueue, the related config changes for ChangeProp was also merg... [18:17:59] a-team: thanks a lot for your patience today, I managed to avoid getting my house into a complete mess, now things are better :) [18:18:09] :D [18:18:21] I am going to restart the namenodes tomorrow to pick up the hdfs trash change [18:18:54] !log restarting all hadoop nodemanagers, 3 at a time to pick up spark2-yarn-shuffle.jar T159962 [18:18:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:18:57] T159962: Spark 2 as cluster default (working with oozie) - https://phabricator.wikimedia.org/T159962 [18:19:56] ottomata: if you restart the master nodes - be aware that there is a change pending, namely moving the prometheus jmx config under /etc/prometheus [18:20:03] it has been applied to all the workers [18:20:17] but not on the master (I was waiting for a chance to restart them) [18:20:33] I am not foreseeing any issue, buuut better to be aware :) [18:20:39] ok cool elukey no plans to restart masters [18:20:51] but good to know [18:20:52] thanks [18:22:38] ottomata: last thing - shall I keep going with jumbo restarts or not tomorrow? [18:22:45] (keeping an eye on mm of course) [18:26:14] anyhow, going off team! byyyee [18:26:50] Bye elukey :) [18:28:28] elukey: if you don't mind restaring all mm main -> jumbos if they freeze [18:28:30] it should be fien [19:15:42] ottomata: since yesterday, I can't log in to 1005 and 1006. any reason you can guess or is this on my end? [19:16:57] leila: bastion changed [19:17:08] leila: so probably you need new machine on ssh confoig [19:17:11] *config [19:17:28] got you. [19:17:54] nuria_: did I miss instructions? (I am trying hard to stay on top of emails.) [19:18:20] leila: they would not come from us but rather from cloud/ops [19:18:33] leila: let me see if i find it [19:19:09] leila: i think bastion-eqiad.wmflabs.org [19:19:27] aha. got you. thanks, nuria_. [19:19:31] * leila goes to fix config file. [19:21:55] leila: let me know if it does not work [19:22:04] ok. thanks, nuria_. [19:26:51] !log restarted camus-webrequest and camus-mediawiki (avro) camus jobs [19:26:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:05:05] (03PS1) 10Ottomata: Refine - Don't call sys.exit if running in YARN [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425347 (https://phabricator.wikimedia.org/T159962) [20:07:58] (03PS2) 10Ottomata: Refine - Don't call sys.exit if running in YARN [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425347 (https://phabricator.wikimedia.org/T159962) [20:11:04] (03PS3) 10Ottomata: Refine - Don't call sys.exit if running in YARN [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425347 (https://phabricator.wikimedia.org/T159962) [20:11:09] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message produce rate in last 30m on einsteinium is CRITICAL: CRITICAL - scalar(sum(avg_over_time(kafka_producer_producer_metrics_record_send_rate{client_id=kafka-mirror-.+-main-eqiad_to_jumbo-eqiad@[0-9]+} [30m]))): 0.0 = 0.0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqi [20:11:29] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message consume rate in last 30m on einsteinium is CRITICAL: CRITICAL - scalar(sum(avg_over_time(kafka_consumer_consumer_fetch_manager_metrics_all_topics_records_consumed_rate{mirror_name=main-eqiad_to_jumbo-eqiad} [30m]))): 0.0 = 0.0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_ [20:11:32] gahhh [20:14:33] !log restart mirrormakers main -> jumbo (AGAIN) [20:14:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:17:19] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message produce rate in last 30m on einsteinium is OK: OK - scalar(sum(avg_over_time(kafka_producer_producer_metrics_record_send_rate{client_id=kafka-mirror-.+-main-eqiad_to_jumbo-eqiad@[0-9]+} [30m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad [20:17:29] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message consume rate in last 30m on einsteinium is OK: OK - scalar(sum(avg_over_time(kafka_consumer_consumer_fetch_manager_metrics_all_topics_records_consumed_rate{mirror_name=main-eqiad_to_jumbo-eqiad} [30m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumb [20:18:29] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is CRITICAL: CRITICAL - scalar(max(max_over_time(kafka_burrow_partition_lag{group=kafka-mirror-main-eqiad_to_jumbo-eqiad,topic!.*change-prop.*} [10m]))): 997251.0 100000.0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad [20:24:39] 10Analytics, 10Analytics-Wikistats, 10Accessibility, 10Easy, 10Patch-For-Review: Wikistats Beta: Fix accessibility/markup issues of Wikistats 2.0 - https://phabricator.wikimedia.org/T185533#4121545 (10MCornacchio) @Volker_E I have a question about this issue: " heading must not be empty" | Which heading(... [20:38:22] !log restarted event* camus and refine cron jobs, puppet is reenabled on analytics1003 [20:38:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:38:34] joal: i merged the patch with the refine and banner job stuff [20:38:36] all camus are back on [20:38:49] refine jobs are back on, using 0.0.61 jar on analytics1003 in my homedir [20:39:03] (03CR) 10Ottomata: [C: 032] Refine - Don't call sys.exit if running in YARN [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425347 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [20:39:11] just merged ^ [20:43:07] !log bouncing main -> jumbo mirrormakers to blacklist job topics until we have time to investigate more [20:43:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:04:47] ottomata, joal: batcave again or we are done for today? [21:05:22] nuria_: In da cave now [21:05:45] be there in a min [21:47:05] 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, 10New-Readers, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4121741 (10DFoy) @BBlack - not sure why OperaMini proxy IPs are no longer being exported. Can this information be re-established? My only... [22:19:08] (03PS1) 10Joal: Fix LZ4 version issue with maven exclusion [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425435 [22:20:41] (03CR) 10Jonas Kress (WMDE): "Thanks, but that is on purpose." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/423336 (https://phabricator.wikimedia.org/T191714) (owner: 10Jonas Kress (WMDE)) [22:21:03] (03PS2) 10Joal: Fix LZ4 version issue with maven exclusion [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425435 [22:22:02] (03CR) 10Joal: [V: 032 C: 032] "Merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/425435 (owner: 10Joal) [22:26:28] (03CR) 10Nuria: [V: 04-1 C: 04-1] "Jonas, intent is good. Now, you want to send our way a bug all that is needed is a ticket, we shall look at it." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/423336 (https://phabricator.wikimedia.org/T191714) (owner: 10Jonas Kress (WMDE)) [22:32:41] (03PS1) 10Joal: Update spark jobs jar and correct assembly path [analytics/refinery] - 10https://gerrit.wikimedia.org/r/425441 [22:37:25] (03PS2) 10Joal: Update spark jobs jar and correct assembly path [analytics/refinery] - 10https://gerrit.wikimedia.org/r/425441 [22:38:55] (03CR) 10Nuria: [C: 032] Update spark jobs jar and correct assembly path [analytics/refinery] - 10https://gerrit.wikimedia.org/r/425441 (owner: 10Joal) [22:41:50] (03CR) 10Joal: [V: 032] Update spark jobs jar and correct assembly path [analytics/refinery] - 10https://gerrit.wikimedia.org/r/425441 (owner: 10Joal) [22:42:35] !log Refinery-source 0.0.61 deployed on archiva [22:42:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:43:10] !log Deploying refinery with scap [22:43:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log