[00:06:17] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:27:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1115 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:27:59] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1106 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:30:01] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1115 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:30:41] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1106 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:43:05] <icinga-wm>	 PROBLEM - Check unit status of monitor_refine_eventlogging_legacy on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:51:55] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[01:09:13] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:18:05] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:30:53] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:51:58] <elukey>	 good morning
[05:56:35] <elukey>	 another big query caused the errors on 1130
[05:56:36] <elukey>	 https://yarn.wikimedia.org/jobhistory/attempts/job_1619507802557_51897/m/KILLED
[05:56:43] <elukey>	 checking the others
[06:02:07] <elukey>	 another one seems to be application_1619507802557_48904 but I don't find it in yarn
[06:02:50] <elukey>	 but this time user analytics
[06:19:28] <elukey>	 I have a theory on this, not sure what it is causing it though
[06:19:44] <elukey>	 let's pick an-worker1130
[06:20:26] <elukey>	 elukey@an-worker1130:~$ sudo systemctl status hadoop-yarn-nodemanager.service  | grep Task Tasks: 618 (limit: 11059)
[06:21:03] <elukey>	 and the processes running as yarn are currently 5
[06:21:49] <elukey>	 the "tasks" are counting, IIUC, both userland threads and also kernel threads
[06:22:21] <elukey>	 the limit is set using some heuristic depending on kernel + hw, but in our case it probably too low
[06:22:40] <elukey>	 as soon as the containers start to increase, the threads raise and we hit a ceiling
[06:22:43] <elukey>	 (this is my theory)
[06:23:00] <elukey>	 so we could raise the limit, it should be very safe
[06:23:11] <elukey>	 the defaults from the kernel are probably not ok for our use case
[06:23:30] <elukey>	 (maybe this is a side effect of the capacity scheduler? different container allocation etc..)
[06:39:48] <elukey>	 so we could apply a systemd override to control the TaskMax parameter
[07:46:34] <joal>	 wow - cool analysis elukey!
[07:49:13] <tanny411>	 elukey, joal: Hi, I am not able to connect to scala spark kernel in jupyter lab. It shows connecting for a while and then disconnected. Thanks to joal, I made sure to kinit. Python3 kernel works fine though.
[07:49:13] <tanny411>	 joal: To use arq in terminal do i have to install arq, scala separately?
[07:49:34] <joal>	 Hi tanny411
[07:49:59] <elukey>	 joal: bonjour :) started https://gerrit.wikimedia.org/r/c/operations/puppet/+/685314 
[07:50:02] <joal>	 elukey: we checked yesterday and it seems tanny411 cannot look at her kernel logs - could you help with that?
[07:50:12] <elukey>	 of course
[07:50:16] <joal>	 Good morning elukey :)
[07:50:24] <elukey>	 tanny411: on what node?
[07:50:40] <elukey>	 Hi also :)
[07:50:50] <tanny411>	 stat1008
[07:51:15] <joal>	 tanny411: for you to be able to use arq on the repl, the jars need to be on the classpath so you can import the needed classes
[07:51:29] <elukey>	 tanny411: also how are you connecting to stat1008?
[07:52:54] <joal>	 tanny411: to add the jar to the running kernel I know 2 ways: The %AddJar magic of toree (see https://github.com/apache/incubator-toree/blob/master/etc/examples/notebooks/magic-tutorial.ipynb)
[07:53:46] <joal>	 tanny411: Or, adding your built jar to your notebook kernel by adding "--jars PATH_OF_JAR" in the spark_opts list
[07:53:54] <joal>	 this last version is my prefered one :)
[07:54:23] <tanny411>	 joal: so that actually needs the project to be built properly. getting errors on that.
[07:54:23] <tanny411>	 elukey: I ssh into it. ssh stat1008.eqiad.wmnet -L 8880:127.0.0.1:8880 to be exact
[07:54:33] <tanny411>	 joal: oh okay
[07:55:05] <joal>	 ok tanny411 - let's try to make that work
[07:55:29] <joal>	 tanny411: I assume the last version of my patch (I corrected some stuff yesterday, it passes jenkins now)
[07:56:52] <elukey>	 joal: one thing at the time, otherwise I don't understand :)
[07:57:31] <elukey>	 so tanny411 can see the jupyterhub UI login page and use it, but cannot create a spark notebook
[07:57:33] <joal>	 tanny411: The error I see from jenkins on your patch was due to problems in my code - I fixed that
[07:57:40] <joal>	 sure elukey - stopping the second thread
[07:58:03] <elukey>	 what is tanny411's username on stat1008?
[07:58:14] <tanny411>	 elukey: akhatun
[07:58:31] <elukey>	 ack perfect, checking in the logs
[07:59:29] <elukey>	 I see stuff like
[07:59:30] <elukey>	 /srv/home/akhatun/.local/share/jupyter/kernels/scala_spark_scala/bin/run.sh: line 45: /usr/local/spark/bin/spark-submit: No such file or directory
[08:00:29] <elukey>	 we have /usr/bin/spark2-submit on the host
[08:00:36] <joal>	 for that error --^ I think you have not configured the scala-spark home correctly tanny411 
[08:01:07] <joal>	 reading the docs
[08:01:32] <joal>	 As expected, the problem comes from me forgetting a line in the docs
[08:01:36] <joal>	 I'm sorry for that
[08:01:41] <joal>	 tanny411, elukey --^
[08:01:46] <joal>	 Correcting :S
[08:02:24] <elukey>	 always Joseph's fault
[08:02:26] <elukey>	 :D :D :D
[08:02:40] <joal>	 indeed elukey - That's why I ask for help!
[08:04:01] <joal>	 ok, corrected - I'm sorry tanny411 - With the "--spark_home="/usr/lib/spark2/" parameter at kernel creation it should work
[08:04:15] <joal>	 docs corrected here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Scala-Spark_or_Spark-SQL_using_Toree
[08:04:55] <tanny411>	 Great!
[08:05:33] <elukey>	 supe
[08:05:35] <elukey>	 *super
[08:10:40] <joal>	 thanks elukey :)
[08:10:56] <joal>	 gone back to kids
[08:23:03] <tanny411>	 elukey, joal : sorry to bother again, but the kernel still seems to disconnect. Started a fresh ssh connection just to be sure. 
[08:23:44] <elukey>	 tanny411: different error this time!
[08:23:45] <elukey>	 May 05 08:22:06 stat1008 bash[21411]: Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/srv/home/akhatun/Notebooks/spark.sql.shuffle.partitions=256
[08:23:49] <elukey>	 May 05 08:22:06 stat1008 bash[21411]:         at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
[08:24:40] <elukey>	 ah I found a missing --conf
[08:25:14] <elukey>	 tanny411: I have updated the cmd on the wiki, can you retry?
[08:25:23] <tanny411>	 Altight
[08:25:27] <tanny411>	 alright*
[08:27:20] <tanny411>	 elukey: Worked! Thanks a lot! 
[08:27:26] <elukey>	 gooood!! :)
[08:27:31] <elukey>	 thank you for the patience :)
[08:27:47] <tanny411>	 :D
[08:54:32] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Technical-blog-posts: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop - https://phabricator.wikimedia.org/T277133 (10elukey) @srodlund draft ready! I shared the gdoc with you and the Analytics team :)
[08:55:33] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10elukey) Almost forgot - the procedure should also include T231067#6863800 :)
[08:59:51] <wikibugs>	 10Analytics-Clusters: Could not find class ::profile::swap for an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T281917 (10elukey) @razzi each check has its own interval, check_puppet_run_changes might run every X hours so it may be slow to update. If you want to get fresh results you can force...
[10:47:19] <elukey>	 joal, razzi, ottomata - we just got a page in SRE due to a heavy job saturating network pipes, namely GPU training :(
[10:47:36] <elukey>	 I killed the job and alerted Miriam/Aiko, but if it re-happens the job is easy to spot
[10:48:55] * elukey lunch!
[11:37:39] <hnowlan>	 Given that the dual loading is working okay now, I might truncate the tables and take the snapshot for the migration to the 3.11 cluster this afternoon
[11:47:12] <joal>	 Hi hnowlan - please give me some time before starting, we got alerts for some failing jobs I'd like t investigate first
[11:47:43] <hnowlan>	 joal: ack, I'll hold! thanks for that
[11:53:05] <wikibugs>	 (03PS1) 10Joal: Correct referrer_daily job's SLA [analytics/refinery] - 10https://gerrit.wikimedia.org/r/685414
[11:55:04] <joal>	 Thanks a lot elukey for the correction of the spark config :S
[12:02:57] <joal>	 !log rerun cassandra-daily-wf-local_group_default_T_top_percountry-2021-5-4
[12:03:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:04:32] <joal>	 elukey: I have a fun feeling about the nodemanager issue we're experiencing and capacity-scheduler strange behavior on used CPUs
[12:22:27] <icinga-wm>	 RECOVERY - Check unit status of monitor_refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:22:52] <joal>	 !log Reset  monitor_refine_eventlogging_legacy after manual rerun of failed job
[12:22:54] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:39:04] <joal>	 hnowlan: I have 2 jobs failing with weird memory errors - Could it be that keyspace/table configuration could be different for them?
[12:39:55] <joal>	 keyspaces are: "local_group_default_T_top_percountry" and "local_group_default_T_unique_devices"
[12:43:49] <hnowlan>	 joal: hmm, they shouldn't be much different to the other keyspaces :( all keyspaces had a change to their caching key because the syntax is different in cassandra 3 but it shouldn't affect insertions and definitely shouldn't be different per table (these are the diffs between versions https://gerrit.wikimedia.org/r/c/analytics/aqs/+/682934)
[12:44:04] <hnowlan>	 What do the memory errors look like? I doubt I'll be much help but I'm curious
[12:46:16] <elukey>	 joal: o/
[12:46:24] <elukey>	 what kind of feeling do you have for capacity?
[12:46:26] <joal>	 hnowlan: My understanding is that the driver creates too many io.netty.util.HashedWheelTimer in the failing cases, while in the non-failing cases it doesn't even report on creating some
[12:46:34] <joal>	 hi elukey 
[12:46:51] <joal>	 elukey: capcity UI currently doesn't report correctly on the number of CPU used per resource
[12:47:31] <elukey>	 joal: IIRC it doesn't take cpus into account at all with the basic settings, only memory
[12:47:53] <joal>	 elukey: Example - https://yarn.wikimedia.org/cluster/scheduler --> application_1619507802557_58723
[12:48:12] <joal>	 This reports 33 containers (correct) with 33 VCPUs
[12:48:19] <elukey>	 yes yes
[12:48:50] <joal>	 https://yarn.wikimedia.org/proxy/application_1619507802557_58723/ shows that we're using 128 tasks, meaning that each worker-container uses 4 VCPUs, not one
[12:49:00] <elukey>	 joal: so yarn.scheduler.capacity.resource-calculator says
[12:49:05] <elukey>	 The ResourceCalculator implementation to be used to compare Resources in the scheduler. The default i.e. org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator only uses Memory while DominantResourceCalculator uses Dominant-resource to compare multi-dimensional resources such as Memory, CPU etc. A Java ResourceCalculator class name is expected. 
[12:49:29] <joal>	 elukey: FairScheduler was using RAM-only resource allocation, and was reporting correctly on used CPUs
[12:49:37] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Migrate eventlog1002 to buster - https://phabricator.wikimedia.org/T278137 (10Ottomata) +1
[12:49:46] <joal>	 elukey: That's why I'm asking
[12:50:12] <elukey>	 joal: no idea
[12:50:16] <joal>	 Could it be that the node manager allocates for 1CPU, and 4 are used, and therefore limits of containers are not prepared correctly?
[12:50:21] <joal>	 elukey: --^
[12:50:22] <joal>	 ?
[12:50:33] * joal is having ideas way over his head
[12:51:11] <elukey>	 joal: not sure we should probably check in more depth
[12:51:24] <joal>	 right elukey 
[12:51:26] <elukey>	 the band aid of having more threads allowed should give us more time to investigate
[12:51:28] <joal>	 will do that later
[12:51:33] <wikibugs>	 10Analytics-Clusters: Could not find class ::profile::swap for an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T281917 (10Ottomata) Hey sorry yall!  I thought I had done a code search and removed all occurrences...must not have noticed this on an-test-client somehow.  Thank you.
[12:51:34] <joal>	 ack
[12:52:07] <mforns>	 heya teammm
[12:53:07] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10dev-images, 10Patch-For-Review: EventLogging dev image should have verbose output enabled - https://phabricator.wikimedia.org/T257378 (10hashar) It seems to be solely for the #analytics team.
[12:53:08] <elukey>	 joal: but I'd be curious to see if it is the resource-calculator implementation, maybe the fair one handles vcores assigned correctly even if memory is the only thing that matters, and the capacity doesn't
[12:57:50] <gmodena>	 I'm learning about the mediawiki.revision-create kafka topic. If I understand it correctly, those events do not contain the revision payload. Is there a canonical way for accessing the revision content? My use case is along the lines of: given a wikipedia article at times t_1 and t_2, I would like to diff t_2 and t_1 in a real-time data pipeline and check if "an image was added at t_2, which was not present at t_1". 
[12:59:52] <joal>	 gmodena: you'll need to get the content from the api for that
[13:02:24] <gmodena>	 joal the Action API? Is the preferred access method also for internal use case (potentially high throughput)?
[13:03:19] <joal>	 gmodena: The throuput needs to be discussed with your team I think :) And about getting content, I think the new API is the one you should use
[13:03:49] <gmodena>	 joal awesome sauce. Thanks for the pointer :)
[13:04:45] <joal>	 gmodena: this type of use case (getting content of revisions and working with them  streaming way) are appearing more and more - we should collaborate on providing streams of useful pre-computed content-related info - that would be awesome (in addition to move to the direction of solving more use-cases :)
[13:06:40] <gmodena>	 joal i'd be happy to join forces! Right now I'm justing playing around with a little spike for learning about kafka. Happy to touch base and bounce ideas around in our next chat :)
[13:07:12] <wikibugs>	 (03CR) 10Ottomata: "> That seems like a really weird requirement." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/680798 (https://phabricator.wikimedia.org/T254891) (owner: 10Neil P. Quinn-WMF)
[13:52:43] <elaragon>	 Hi! I am trying to gather data about spambots that have been blocked globally. To do this, I am parsing the monthly pages of stewards requests (e.g. https://meta.wikimedia.org/wiki/Steward_requests/Global/2021-04) and then examining individually which of these users have been flagged as spambots (e.g., https://meta.wikimedia.org/wiki/Special:CentralAuth/AnonymousRebellion)... is there a table where I could 
[13:52:43] <elaragon>	 get this information directly from?
[14:40:17] <wikibugs>	 (03PS8) 10Kosta Harlan: [WIP] Create structured_task/article/link_suggestion_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/681052 (https://phabricator.wikimedia.org/T278177)
[14:44:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Create structured_task/article/link_suggestion_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/681052 (https://phabricator.wikimedia.org/T278177) (owner: 10Kosta Harlan)
[14:47:09] <wikibugs>	 (03PS9) 10Kosta Harlan: [WIP] Create structured_task/article/link_suggestion_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/681052 (https://phabricator.wikimedia.org/T278177)
[14:48:11] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Data-Infrastructure, 10Patch-For-Review: EventGate idea: use presence of schema properties in http.(request|response)_headers to automatically set header values in event data - https://phabricator.wikimedia.org/T263466 (10Ottomata) 05Open→...
[15:41:41] <wikibugs>	 (03PS2) 10Hnowlan: Add docker-compose environment with cassandra [analytics/aqs] - 10https://gerrit.wikimedia.org/r/679295
[15:42:47] <elukey>	 hnowlan: <3
[15:43:42] <wikibugs>	 (03CR) 10Kosta Harlan: "> Patch Set 7:" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/681052 (https://phabricator.wikimedia.org/T278177) (owner: 10Kosta Harlan)
[15:43:55] <hnowlan>	 :D
[15:44:07] <wikibugs>	 (03PS10) 10Kosta Harlan: [WIP] Create structured_task/article/link_suggestion_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/681052 (https://phabricator.wikimedia.org/T278177)
[15:44:19] <hnowlan>	 the above now works (for real I promise) and is ready for review if anyone wants to take it for a spin 
[15:47:09] <wikibugs>	 10Analytics, 10Event-Platform: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Ottomata)
[15:47:21] <wikibugs>	 10Analytics, 10Event-Platform: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Ottomata) a:03Ottomata
[15:47:31] <wikibugs>	 10Analytics, 10Event-Platform: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Ottomata)
[15:49:35] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 5 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Ottomata)
[15:49:38] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 5 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Ottomata)
[15:50:59] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] [WIP] Create structured_task/article/link_suggestion_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/681052 (https://phabricator.wikimedia.org/T278177) (owner: 10Kosta Harlan)
[15:54:42] <wikibugs>	 10Analytics, 10Analytics-Kanban: Add logic to purging scripts that requires admin action if it's about to delete a lot of data - https://phabricator.wikimedia.org/T270433 (10mforns) a:03mforns
[15:55:20] <wikibugs>	 (03PS1) 10Ottomata: Add WikipediaPortal to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/685513 (https://phabricator.wikimedia.org/T282012)
[16:03:07] <wikibugs>	 10Analytics, 10Event-Platform, 10Patch-For-Review: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Ottomata) @mpopov do you know who maintains [[ https://gerrit.wikimedia.org/r/admin/repos/wikimedia%2Fportals | wikimedia/portals ]]?  It looks like it has a [[ https://g...
[16:04:09] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Data-Infrastructure, and 2 others: VirtualPageView should use EventLogging api to send virtual page view events - https://phabricator.wikimedia.org/T279382 (10mforns) Hi! I just realized that, when this change is deployed to prod, we'll be miss...
[16:11:34] <awight>	 mforns: Just to highlight my latest obstacle, https://phabricator.wikimedia.org/T273748#7051951
[16:11:49] <awight>	 Otherwise, the new metrics seem perfectly healthy!
[16:19:08] <wikibugs>	 (03PS11) 10Kosta Harlan: [WIP] Create structured_task/article/link_suggestion_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/681052 (https://phabricator.wikimedia.org/T278177)
[16:22:00] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] "My bad, this was all copy pasta from an hourly job, and I tried to fix most of it but keep finding things I missed.  Yaaaay oozie... :(" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/685414 (owner: 10Joal)
[16:23:20] <wikibugs>	 (03PS12) 10Kosta Harlan: [WIP] Create structured_task/article/link_suggestion_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/681052 (https://phabricator.wikimedia.org/T278177)
[16:25:24] <wikibugs>	 (03PS13) 10Kosta Harlan: [WIP] Create structured_task/article/link_suggestion_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/681052 (https://phabricator.wikimedia.org/T278177)
[16:35:33] <joal>	 elukey: have updated the number of tasks on hadoop or not yet?
[16:35:53] <wikibugs>	 (03PS14) 10Kosta Harlan: [WIP] Create structured_task/article/link_suggestion_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/681052 (https://phabricator.wikimedia.org/T278177)
[16:36:37] <elukey>	 joal: not yet, still in code review
[16:36:55] <joal>	 ack elukey - it looks like my cassandra issue could be related
[16:37:16] <joal>	 There still is some behavior from cassandra driver I don't understand though
[16:42:00] <joal>	 elukey: https://issues.apache.org/jira/browse/YARN-9839 ?
[16:43:56] <elukey>	 joal: yeah I have it open as well, but I'd like to test the TaskMax first
[16:44:05] <joal>	 yeah
[16:44:12] <joal>	 elukey: The root cause of this issue was an OS level configuration which was not letting OS to overcommit virtual memory. 
[16:44:21] <joal>	 there is a feel of virtual-memory
[16:45:59] <elukey>	 joal: that was the problem of one single person reporting it, it may be it but we shouldn't trust that solution blindly
[16:46:10] <joal>	 yeah true
[16:51:19] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Migrate eventlog1002 to buster - https://phabricator.wikimedia.org/T278137 (10hnowlan)
[16:57:17] <hnowlan>	 gonna start the decom of eventlog1002 
[16:59:50] <hnowlan>	 actually before I do - there is a large folder in /srv/home/nuria/T219842_kafka_jumbo_outage. Is that worth saving? cc ottomata elukey razzi 
[17:00:47] <joal>	 elukey: I wonder - shall we move to DominantResourceCalculator instead of DefaultResourceCalculator?
[17:01:22] <elukey>	 joal: this is a good question, maybe tomorrow with coffee ? :)
[17:01:35] <joal>	 sure elukey :)
[17:01:38] <elukey>	 <#
[17:01:40] <elukey>	 <3
[17:04:20] <wikibugs>	 (03CR) 10Nettrom: "Responding to Gergő and Kosta's discussion on generalizing recommendations" (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/681052 (https://phabricator.wikimedia.org/T278177) (owner: 10Kosta Harlan)
[17:11:37] <Amir1>	 elukey: hey, for when you have time T281809 would be amazing
[17:11:39] <stashbot>	 T281809: Requesting a kerberos identity for user sihe - https://phabricator.wikimedia.org/T281809
[17:11:44] <Amir1>	 it's blocking a colleague 
[17:13:34] * joal is stuck in cassandra darkness again :(
[17:36:14] <razzi>	 !log create principal for sihe: sudo manage_principals.py create sihe --email_address=silvan.heintze@wikimedia.de
[17:36:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:37:26] <wikibugs>	 10Analytics, 10Patch-For-Review: Requesting a kerberos identity for user sihe - https://phabricator.wikimedia.org/T281809 (10razzi) a:03razzi
[17:38:16] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10WMDE-New-Editors-Banner-Campaigns: Drop old WMDEBanner events from Hive - https://phabricator.wikimedia.org/T281300 (10Merle_von_Wittich_WMDE) hey @GoranSMilovanovic  I am wondering if the raw data mentioned above is relevant for your old reports?
[17:41:19] <elukey>	 Amir1: I was in a meeting but razzi was faster :)
[17:41:37] <Amir1>	 Awesome. Thank you both!
[17:47:37] <wikibugs>	 10Analytics, 10Patch-For-Review: Requesting a kerberos identity for user sihe - https://phabricator.wikimedia.org/T281809 (10razzi) Should be all set; email was sent to silvan.heintze@wikimedia.de.
[17:48:37] * elukey afk!
[17:52:43] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10WMDE-New-Editors-Banner-Campaigns: Drop old WMDEBanner events from Hive - https://phabricator.wikimedia.org/T281300 (10GoranSMilovanovic) @Merle_von_Wittich_WMDE I don't think so. All the datasets that we need to re-render the old reports in R markdown should stil...
[17:58:01] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Growth-Team, and 3 others: Revisions missing from mediawiki_revision_create - https://phabricator.wikimedia.org/T215001 (10Ottomata) It's been a couple of weeks since I sent an email asking if anyone needed or used revert info in mediawiki.revision-creat...
[18:05:01] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10WMDE-New-Editors-Banner-Campaigns: Drop old WMDEBanner events from Hive - https://phabricator.wikimedia.org/T281300 (10Ottomata) A related q as we are figuring this out.  Are these used at all?  If not, we would like to stop collecting them as part of {T259163}....
[18:05:45] <wikibugs>	 10Analytics, 10SRE, 10Traffic, 10Patch-For-Review: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10CDanis)
[18:06:41] <wikibugs>	 10Analytics, 10SRE, 10Traffic, 10Patch-For-Review: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10CDanis) @fdans @JAllemandou New map entry should be ready for Analytics to set up in Turnilo :)
[18:06:57] <wikibugs>	 10Analytics, 10WMDE-Analytics-Engineering, 10WMDE-New-Editors-Banner-Campaigns: Drop old WMDEBanner events from Hive - https://phabricator.wikimedia.org/T281300 (10Ottomata) @gabriel-wmde  @CorinnaHillebrand_WMDE @Tim_WMDE
[18:10:27] <wikibugs>	 10Analytics, 10Platform Engineering: AirFlow collaboration between PE and DE - https://phabricator.wikimedia.org/T282033 (10Milimetric)
[18:10:46] <wikibugs>	 10Analytics, 10Platform Engineering: AirFlow collaboration between PE and DE - https://phabricator.wikimedia.org/T282033 (10Milimetric)
[18:10:49] <wikibugs>	 10Analytics, 10Analytics-Kanban: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Milimetric)
[18:12:23] <milimetric>	 gmodena / clarakosi: I made this parent task: https://phabricator.wikimedia.org/T282033, sorry for the triple ping, just making sure you see it.  Feel free to add/change anything you like, I'll try and track any ongoing work there.
[18:15:27] <wikibugs>	 10Analytics, 10Platform Engineering: Catalog, Categorize, and Templetize existing scheduled workflows - https://phabricator.wikimedia.org/T282035 (10Milimetric)
[18:18:01] <wikibugs>	 10Analytics: Requesting a kerberos identity for user sihe - https://phabricator.wikimedia.org/T281809 (10razzi) 05Open→03Resolved @Silvan_WMDE Read the user guide at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide and comment here or chat in #wikimedia-analytics on IRC if you run int...
[18:22:58] <wikibugs>	 10Analytics-Clusters: Could not find class ::profile::swap for an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T281917 (10razzi) 05Open→03Resolved Ok, sure enough, the alert has removed an-test-client from its erroring nodes.
[18:23:41] <ottomata>	 fdans:  yt?
[18:23:49] <ottomata>	 is EL schema AutomatedRequest used?
[18:23:50] <ottomata>	 looks like no data
[18:23:52] <ottomata>	 and you created it
[18:23:52] <ottomata>	 :)
[18:23:58] <ottomata>	 https://meta.wikimedia.org/wiki/Schema:AutomatedRequest
[18:27:03] <fdans>	 wat
[18:27:18] <fdans>	 i created no such thing
[18:28:00] <ottomata>	 fdans: 
[18:28:00] <ottomata>	 https://meta.wikimedia.org/w/index.php?title=Schema:AutomatedRequest&action=history
[18:28:01] <ottomata>	 :)
[18:28:04] <fdans>	 oh I guess I did
[18:28:10] <fdans>	 I have no memory of this
[18:28:19] <ottomata>	 your memory sounds like it works like mine
[18:28:23] <ottomata>	 LRU purging
[18:28:37] <ottomata>	 ok, then i will mark it to decomission
[18:31:27] <fdans>	 I remember around that time I was doing the report on weird requests coming from middle east countries on IE11
[18:31:59] <fdans>	 so that's kinda related but I have no idea why would I create a new schema for anything like that
[18:32:05] <fdans>	 oh well
[18:33:10] <ottomata>	 awight: hi!
[18:33:13] <gmodena>	 milimetric terrific, thanks for this!
[18:33:16] <ottomata>	 is EditDebugging schema used at all?
[18:33:19] <ottomata>	 https://meta.wikimedia.org/w/index.php?title=Schema:EditDebugging&action=history
[18:33:29] <ottomata>	 you created it long ago, and it isn't receiving any traffic
[18:34:08] <ottomata>	 same q for EditLifecycle
[18:34:49] <wikibugs>	 10Analytics, 10Platform Team Workboards (Image Suggestion API): AirFlow collaboration between PE and DE - https://phabricator.wikimedia.org/T282033 (10Clarakosi)
[18:44:30] <wikibugs>	 10Analytics, 10Platform Team Workboards (Image Suggestion API): AirFlow collaboration between PE and DE - https://phabricator.wikimedia.org/T282033 (10Clarakosi)
[19:30:46] <ottomata>	 AndyRussG: yt?
[19:56:23] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 5 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Ottomata) Ok, I've started sorting through the long tail of schemas in the [[ https://docs.google.com/spreadsheets/...
[20:26:04] <wikibugs>	 10Analytics, 10Product-Analytics: Add timestamps of important revision events to mediawiki_history - https://phabricator.wikimedia.org/T266375 (10Isaac) @Ottomata thanks for the ping. Yeah, I'm aware of the table but the challenge has always been whether you can reconstruct the page restrictions on a page at a...
[20:27:45] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Technical-blog-posts: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop - https://phabricator.wikimedia.org/T277133 (10srodlund) @elukey Awesome!!! I will take a pass at this tomorrow!
[20:30:16] <wikibugs>	 10Analytics, 10Product-Analytics: Add timestamps of important revision events to mediawiki_history - https://phabricator.wikimedia.org/T266375 (10Ottomata) Oh interesting.  Perhaps we should capture the expiry in that stream too!
[20:32:08] <wikibugs>	 10Analytics, 10Product-Analytics: Default table creation settings results in warnings when querying - https://phabricator.wikimedia.org/T277822 (10Milimetric) 05Open→03Resolved a:03Milimetric That particular warning seems to be gone, and what's left are the log4j warnings, which I'm looking into.  Please...
[20:38:54] <wikibugs>	 10Analytics, 10Product-Analytics: Add timestamps of important revision events to mediawiki_history - https://phabricator.wikimedia.org/T266375 (10Isaac) > Oh interesting. Perhaps we should capture the expiry in that stream too! Yeah, if it's straightforward, that'd be appreciated! I actually had a use-case for...
[20:54:04] <wikibugs>	 10Analytics, 10Event-Platform, 10Platform Engineering: Add expiry info to mediawiki.page-restrictions-change stream - https://phabricator.wikimedia.org/T282057 (10Ottomata)
[20:55:27] <isaacj>	 thanks ottomata !
[21:12:54] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Growth-Team, and 3 others: Revisions missing from mediawiki_revision_create - https://phabricator.wikimedia.org/T215001 (10RBrounley_WMF) @Protsack.stephan - do we use the reverts in revision-create? sorry i didn't respond to your email, I actually had f...
[21:36:53] <awight>	 ottomata: Thanks for the ping—those two schemas can be removed with great haste :-).  I started to introduce them as a volunteer-time thing, back before we had introduced similar events which did this better.  I don't believe any of my experimental patches were ever merged.
[21:40:09] <awight>	 Unrelatedly, can someone with an-runner access let us know whether this job is throwing any errors?  We don't know why it seems to be failing: https://phabricator.wikimedia.org/T273748#7051951