[00:14:42] <icinga-wm>	 RECOVERY - Check the last execution of camus-mediawiki_job on an-launcher1001 is OK: OK: Status of the systemd unit camus-mediawiki_job https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:27:26] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_eventlogging_analytics on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:40:50] <icinga-wm>	 PROBLEM - Check the last execution of reportupdater-published_cx2_translations_mysql on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit reportupdater-published_cx2_translations_mysql https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:53:28] <icinga-wm>	 PROBLEM - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:34:56] <icinga-wm>	 RECOVERY - Check the last execution of reportupdater-published_cx2_translations_mysql on an-launcher1001 is OK: OK: Status of the systemd unit reportupdater-published_cx2_translations_mysql https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:36:48] <icinga-wm>	 RECOVERY - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is OK: OK: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:11:04] <icinga-wm>	 PROBLEM - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:18:12] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_mediawiki_job_events on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:20:00] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_eventlogging_analytics_failure_flags on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:39:26] <icinga-wm>	 PROBLEM - Check the last execution of camus-event_dynamic_stream_configs on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-event_dynamic_stream_configs https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[03:12:22] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_mediawiki_job_events on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[03:14:08] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_eventlogging_analytics_failure_flags on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[04:14:32] <icinga-wm>	 PROBLEM - Check the last execution of camus-mediawiki_job on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_job https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[04:22:28] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_delayed on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[04:27:58] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_immediate on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[04:51:21] <wikibugs>	 10Quarry, 10Data-Services: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Marostegui) Thank you for your understanding
[04:56:43] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Wikimedia-production-error: [beta] EventLogging trying to fetch wrong Schema title - https://phabricator.wikimedia.org/T254058 (10Ammarpad) HTTP Error 400: Bad Request because the url is malformed. It's mixing revid with a page it does...
[05:07:22] <icinga-wm>	 PROBLEM - Check the last execution of analytics-dumps-fetch-mediawiki_history_dumps on labstore1007 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-mediawiki_history_dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:11:12] <icinga-wm>	 PROBLEM - Check the last execution of analytics-dumps-fetch-mediawiki_history_dumps on labstore1006 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-mediawiki_history_dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:15:02] <icinga-wm>	 RECOVERY - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is OK: OK: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:22:02] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_immediate on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:22:08] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_netflow_daily on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_daily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:22:16] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_mediawiki_job_events on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:27:40] <icinga-wm>	 PROBLEM - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:52:58] <elukey>	 goooood morning
[05:52:59] <elukey>	 sigh
[05:59:45] <elukey>	 ah snap the OOM killer 
[05:59:49] <elukey>	 oooofff
[06:00:08] <icinga-wm>	 RECOVERY - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is OK: OK: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[06:03:45] <elukey>	 !log kill all airflow-related processes on an-launcher1001 - host killing tasks due to OOM
[06:03:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:12:46] <icinga-wm>	 PROBLEM - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[06:13:36] <icinga-wm>	 RECOVERY - Check the last execution of camus-mediawiki_job on an-launcher1001 is OK: OK: Status of the systemd unit camus-mediawiki_job https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[06:15:06] <elukey>	 so the biggest offenders are RU jobs
[06:15:44] <icinga-wm>	 RECOVERY - Check the last execution of camus-event_dynamic_stream_configs on an-launcher1001 is OK: OK: Status of the systemd unit camus-event_dynamic_stream_configs https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[06:16:12] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_netflow_daily on an-launcher1001 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_daily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[06:22:43] <wikibugs>	 10Analytics, 10Operations: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) p:05Triage→03High
[06:22:45] <wikibugs>	 10Analytics, 10Operations: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) In theory it should be very simple: https://wikitech.wikimedia.org/wiki/Ganeti#Increase/Decrease_CPU/RAM
[06:23:36] <icinga-wm>	 RECOVERY - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is OK: OK: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[06:28:48] <elukey>	 !log temporary stop of all RU jobs on an-launcher1001 to priviledge camus and others
[06:28:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:44:40] <elukey>	 memory consumption went down a lot, and also the pagein/pageout metrics
[06:44:57] <elukey>	 I am wondering if we should create an-launcher1002 only for RU
[06:45:11] <elukey>	 but IIRC it will go away, so maybe only adding some memory will suffice
[06:45:22] <elukey>	 but of course we have some contraints in ganeti now
[06:53:44] <elukey>	 !log re-run virtualpageview-hourly-wf-2020-5-31-19
[06:53:45] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:04:23] <elukey>	 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=an-launcher1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics&from=now-3h&to=now
[07:05:16] <elukey>	 I checked all the alarms and I think we are good, a lot of spam due to things getting stuck
[07:05:27] <elukey>	 but please re-check if you have time so we are sure
[07:07:43] <elukey>	 brb
[08:22:05] <djellel>	 good morning o/
[08:23:57] <djellel>	 I have 3 tasks stuck in spark application, is it possible to manually kill them (so that they can restart) without causing the whole job to fail? these are 3 out of 57000, the rest succeeded. 
[08:25:04] <elukey>	 good morning, never done it so no idea, usually we kill the spark yarn app
[08:33:02] <wikibugs>	 10Analytics, 10Growth-Team, 10GrowthExperiments, 10Product-Analytics: Homepage: ensure data retention is in line with the guideline exception - https://phabricator.wikimedia.org/T235577 (10Aklapper) @nettrom_WMF: The `Due Date` set for this open task is four months ago. Can you please either update or rese...
[08:46:50] <djellel>	 elukey: Is it possible to login to the worker node and kill the process ?
[08:48:00] <elukey>	 djellel: this is not a clean process, the analytics team can do it but usually it is better to stop/start the job cleanly
[08:49:13] <djellel>	 elukey: I understand, that's what I would do in a normal case, but I am trying to save a 15h job that sifted through 8TB :/ 
[08:49:37] <elukey>	 okok, do you know the workers ?
[08:50:31] <djellel>	 hosts: 1064 1072 1089
[08:51:03] <djellel>	 Ids: 69750 100653 105326
[08:51:17] <djellel>	 application_1589903254658_45554
[08:51:46] <elukey>	 ok on 1064 I have two
[08:52:03] <elukey>	 two containers I mean
[08:52:13] <elukey>	 do you want to kill them both or only a specific one?
[08:52:20] <djellel>	 the one that has a "69750"
[08:52:54] <elukey>	 mm I have
[08:52:55] <elukey>	 container_e11_1589903254658_45554_01_000215
[08:53:52] <elukey>	 ah maybe they are related, checking
[08:54:55] <elukey>	 ok done
[08:55:07] <elukey>	 djellel: can you check if 1064 is unblocked?
[08:55:12] <elukey>	 if so I'll proceed with the rest
[08:56:49] <elukey>	 joal: https://grafana.wikimedia.org/d/000000585/hadoop?panelId=25&fullscreen&var-hadoop_cluster=analytics-hadoop&orgId=1&from=now-14d&to=now :(
[08:56:56] <elukey>	 bonjour (when you are online)
[08:56:56] <djellel>	 looks like it did the trick
[08:57:18] <elukey>	 djellel: okok 5 euros fee for each host, shall I proceed with the rest? :D
[08:57:55] <djellel>	 lol, that's higher than a carousel ticket
[08:57:59] <elukey>	 ahahha
[08:58:05] <elukey>	 1072 should be unblocked as well
[08:59:00] <elukey>	 and also 89
[09:00:31] <elukey>	 djellel: all ok?
[09:00:55] <djellel>	 elukey:  I don't know if my job will succeed, but FYI the tasks relaunched !! thank you so much.
[09:06:49] <elukey>	 djellel: super, glad that worked
[09:48:10] <icinga-wm>	 RECOVERY - Check the last execution of analytics-dumps-fetch-mediawiki_history_dumps on labstore1006 is OK: OK: Status of the systemd unit analytics-dumps-fetch-mediawiki_history_dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:48:51] <elukey>	 \o/
[09:49:18] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Wikimedia-production-error: [beta] EventLogging trying to fetch wrong Schema title - https://phabricator.wikimedia.org/T254058 (10MarcoAurelio) This is a trace from Logstash-Beta I spotted and reported here. I did not insert anything....
[09:52:36] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: hdfs-rsync of mediawiki history dumps fails due to source not present (yet) - https://phabricator.wikimedia.org/T251858 (10elukey) Seems to work!  ` Jun 01 09:51:22 labstore1006 kerberos-run-command[12033]: Ignoring missing hdfs source hdfs:///wmf/data/arc...
[09:53:07] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: hdfs-rsync of mediawiki history dumps fails due to source not present (yet) - https://phabricator.wikimedia.org/T251858 (10elukey)
[09:58:28] <icinga-wm>	 RECOVERY - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_immediate on an-launcher1001 is OK: OK: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:59:52] <icinga-wm>	 RECOVERY - Check the last execution of monitor_refine_eventlogging_analytics_failure_flags on an-launcher1001 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_analytics_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:59:58] <elukey>	 good
[10:02:52] <icinga-wm>	 RECOVERY - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_delayed on an-launcher1001 is OK: OK: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:07:16] <icinga-wm>	 PROBLEM - Check the last execution of check_webrequest_partitions on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:09:49] <wikibugs>	 10Analytics, 10Operations: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) This is the current status for eqiad:  ` elukey@ganeti1003:~$ sudo gnt-node list Node                   DTotal  DFree MTotal MNode MFree Pinst Sinst ganeti1001.eqiad.wmnet 707.4G   9....
[10:12:06] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_eventlogging_analytics_failure_flags on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:12:58] <elukey>	 yes the check_webrequest_partitions complains since we are behind schedule
[10:13:06] <elukey>	 but webrequest is being processed
[10:13:33] <elukey>	 (well text is behind schedule, upload already caught up)
[10:40:49] * elukey lunch!
[10:55:01] <wikibugs>	 10Analytics-Kanban, 10Trash: --- DISCUSSED BELOW --- - https://phabricator.wikimedia.org/T114124 (10Aklapper) @Milimetric: If this is still actively used? If yes then this should really be done via two separate columns instead of this non-obvious error-prone concept.
[11:54:13] <joal>	 Hi elukey - normally I'm off today, but I'm gonna spend some time helping with the errors and all - what a mess !@
[12:39:12] <elukey>	 joal: hola! No problem, all good, please don't work today :)
[12:39:16] <elukey>	 everything seems under control
[12:39:50] <joal>	 elukey: from me chiming in, isuue was created by airflow taking too much mem on an-launcher1001, right?
[12:40:51] <elukey>	 joal: initially I thought it was airflow, but it was RU
[12:41:05] <joal>	 really? I'd be interested to know which querie
[12:41:05] <elukey>	 I mean the bulk of the problem was RU
[12:41:10] <joal>	 ack
[12:41:38] <joal>	 elukey: it looks like CPU usage on an-launcher has been steady for the past 2/3 days at ~50%, which was not the case before - Something must have changed
[12:41:51] <elukey>	 that is airflow
[12:42:02] <elukey>	 but the issue was memory consumption in this case
[12:42:03] <joal>	 Ah ok :)
[12:42:14] <joal>	 ok :)
[12:42:18] <elukey>	 yes yes I noticed that some days ago and forgot to follow up with Marcel
[12:42:41] <joal>	 no problemo, I was trying to make sense of ehat happens in my head: )
[12:43:05] <elukey>	 yes sorry I made a horrible summary probably
[12:43:07] <joal>	 elukey: Now about memory, this is bizarre nontheless - possibly some hive optimization (local-joins) happening on the waorker?
[12:43:23] <elukey>	 could be yes
[12:43:37] <joal>	 elukey: I was not complaining about any summary, I was trying to understand :) you fixed, summary will come after :)
[12:44:45] <elukey>	 also joal we are using 2PB on hdfs :(
[12:44:54] <joal>	 I've seen that elukey
[12:45:31] <joal>	 elukey: I'm part of the problem (wikitext fixing, so more data) - But something else is happening that I don't know - I'm investigating
[12:45:48] <joal>	 elukey: growing ~70Tb in 1h is not that frequent
[12:46:09] <elukey>	 joal: you are on holidays, we can manage, please remove your hands from the keyboard :)
[12:46:23] <elukey>	 (you tell me the same when I do it!)
[12:46:47] <joal>	 true
[12:47:43] <joal>	 ok, will drop elukey - I still think investigation is needed about the hdfs used-space growth :)
[12:48:13] <elukey>	 joal: yep will do it with mforns when he comes online, I promise :)
[12:48:14] <elukey>	 <3
[12:48:43] <djellel>	 fyi, I've been trying to process wikitext, unsuccessfully.., maybe this generates some intermediary files?
[12:49:22] <joal>	 djellel: hi - it might be related yes - I don't think it would be intermediate files, as those are not on hdfs
[12:49:31] <elukey>	 djellel: your home is quite big, I was wondering if it was intended or not
[12:50:35] <elukey>	 djellel: it is around 7TB (that become 21 replicated 3 times)
[12:50:54] <joal>	 elukey, djellel : this is due to .Trash (just checked)
[12:51:25] <elukey>	 6.1 T    18.3 T   /user/dedcode/.Trash
[12:51:28] <elukey>	 nice L)
[12:51:29] <elukey>	 :)
[12:51:33] <elukey>	 djellel: can I delete?
[12:51:59] <elukey>	 (trash is where files go when you delete them without -skipTrash, and they stay there for a month)
[12:52:09] <elukey>	 (prevents accidental oooops)
[12:52:14] <joal>	 ok I get it now
[12:52:23] <joal>	 there is that trash (making 20T)
[12:52:56] <joal>	 Plus /user/hive/warehouse/dedcode.db (df2,enwiki_history_part_year,enwiki_history_part_year2
[12:53:23] <joal>	 That sums to ~75Tb - We have it all
[12:53:28] <djellel>	 yes, nuke it :)
[12:53:33] <joal>	 the timestamps match
[12:54:04] <joal>	 ok now we now - Onl to decide what to do - Gone for now ;)
[12:54:35] <elukey>	 !log /user/dedcode/.Trash/* -skipTrash
[12:54:36] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:55:46] <elukey>	 ok now I see
[12:55:46] <elukey>	 6.1 T    18.3 T   /user/hive/warehouse/dedcode.db/df2
[12:55:46] <elukey>	 6.1 T    18.3 T   /user/hive/warehouse/dedcode.db/enwiki_history_part_year
[12:55:49] <elukey>	 6.1 T    18.3 T   /user/hive/warehouse/dedcode.db/enwiki_history_part_year2
[12:55:52] <elukey>	 djellel: --^
[12:56:07] <joal>	 just before leaving djellel - It looks like diffs take a lot more space than text itself - brute force without some heuristic won't make it I think :)
[12:56:22] <djellel>	 elukey: same, failed jobs
[12:56:45] <djellel>	 joal: don't worry, I am not using the diffmathpatch
[12:57:21] <djellel>	 it's a lot smaller, for some reason, processing the whole thing gets to 99.99% and then fails
[12:57:45] <elukey>	 djellel: I'd ask you to nuke the dbs that you don't use if possible, dropping the data etc.. it is quicker and you have more context
[12:58:18] <djellel>	 joal: lets talk tomorrow if you have time, I feel it's a matter of tuning the job. enjoy your day for now :)
[12:58:42] <djellel>	 elukey: I actually ran a drop table from hive :/ 
[13:00:38] <djellel>	 elukey: dfs -rm ?
[13:02:39] <elukey>	 djellel: if you dropped the tables from hive yes, you can remove them with -rm -r -skipTrash etc..
[13:02:50] <elukey>	 hdfs dfs rm -r -skipTrash
[13:03:51] <elukey>	 checking your db
[13:04:00] <joal>	 also elukey, just checked sqoop - seems all good this month ;)
[13:04:13] <elukey>	 joal: \o/
[13:05:42] <djellel>	 elukey: done
[13:05:57] <joal>	 ok gone for real - will check later tonight
[13:09:23] <elukey>	 djellel: thanks a lot!
[13:18:34] <ottomata>	 elukey: o/  am filling out quarter estimates in the hw budget sheet
[13:18:46] <ottomata>	 i filled out a couple but you might have better guesses on some than me
[13:22:31] <elukey>	 ottomata: o/ I had a chat with Faidon this morning, perfect timing since he's checking the sheet in these days, the sooner we havea  final version the better
[13:23:52] <ottomata>	 ok great
[13:24:07] <ottomata>	 lets fill in some guesses real quick then?
[13:24:16] <ottomata>	 expansion after bigtop
[13:24:26] <ottomata>	 don't remember exactly when we are trying to do bigtop?
[13:24:51] <elukey>	 not before Q2 I think
[13:25:26] <elukey>	 so Q2/Q3 could be good
[13:26:01] <ottomata>	 ok, let's say Q3
[13:26:12] <ottomata>	 cassandra refresh?
[13:26:55] <cdanis>	 hello elukey, do you have any thoughts what might have happened to netflow data in this interval? https://w.wiki/SVo
[13:27:13] <cdanis>	 31 May, 18:30-19:00 exactly, 0 data points
[13:27:56] <cdanis>	 I've checked one nfacctd and it was still sending stuff to kafka in this interval
[13:32:04] <elukey>	 ottomata: Q2 could be doable
[13:32:29] <ottomata>	 k!
[13:32:31] <ottomata>	 thorium replacement?
[13:32:43] <elukey>	 cdanis: weird, I am wondering if data is in hive/hdfs
[13:33:37] <ottomata>	 i see netflow data in hive in ffor hour=29  at least
[13:33:42] <elukey>	 ah nice
[13:33:42] <ottomata>	 and hour=18
[13:34:06] <ottomata>	 and 17
[13:34:11] <ottomata>	 cdanis:  how do you consume the data?  
[13:34:15] <ottomata>	 not in hive?
[13:34:19] <elukey>	 ottomata: for thorium.. Q2?
[13:34:32] <ottomata>	 k!
[13:34:35] <elukey>	 ottomata: mostly from Druid via turnilo
[13:34:41] <elukey>	 as far as I know
[13:34:41] <ottomata>	 druid100[1-3] refresh?
[13:34:42] <ottomata>	 ah
[13:34:47] <ottomata>	 is that realtime into druid?
[13:34:50] <ottomata>	 or via hive?
[13:34:56] <elukey>	 both :)
[13:35:02] <ottomata>	 oh right ok
[13:35:03] <ottomata>	 hmmm
[13:35:13] <ottomata>	 so at the very least the hive part should replace if anything is missing
[13:35:16] <ottomata>	 not sure when that runs
[13:35:22] <elukey>	 ahh wait ther was a refine error for netflow IIRC
[13:35:59] <elukey>	 but that was for the 29th
[13:36:01] <elukey>	 nevermind
[13:37:00] <elukey>	 ottomata: re druid100[1-3], since we can reimage anytime, I'd say that Q1/Q2 is fine
[13:37:15] <elukey>	 we'll have to be careful in moving zookeeper to new nodes
[13:37:23] <elukey>	 but should be easy
[13:37:37] <elukey>	 but it can wait Q3
[13:37:49] <elukey>	 otherwise too many things in Q1/Q2
[13:37:53] <elukey>	 ottomata: does it make sense?
[13:38:03] <wikibugs>	 10Analytics, 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Hive access for Sam Patton - https://phabricator.wikimedia.org/T248097 (10Dzahn) Hi, gentle prod. Any updates on this? It has been open for a while now.
[13:38:30] <elukey>	 cdanis: can you open a task if you have time? Today it has been a bit problematic for us, a lot of things exploding, if this is not urgent I'll try to check tomorrow
[13:42:28] <miriam>	 ciao analytics :) I hope youare doing well! i have a question! how often do we updates hive tables like mediawiki_imagelinks in wmf_raw? It seems that the last update is 12/2019. Is there a place where I can find a more recently updated table?
[13:50:49] <cdanis>	 elukey: sure, no urgency, will file a task
[13:52:18] <wikibugs>	 10Analytics, 10Operations, 10Traffic: missing wmf_netflow data, 18:30-19:00 May 31 - https://phabricator.wikimedia.org/T254161 (10CDanis)
[14:22:23] <mforns>	 heya teamm
[14:25:30] <ottomata>	 hellOooO
[14:31:29] <ottomata>	 elukey:  cool ya Q3 is good
[14:31:40] <ottomata>	 how about dedicated superset/turnilo?
[14:31:49] <ottomata>	 and RAM for Hadoop masters?
[14:32:16] <ottomata>	 miriam:  i don't know about imagelinks, maybe mforns  knows or knows who does?
[14:33:35] <elukey>	 ottomata: Q2 should be fine, we might need it sooner rather than later if people's usage of superset ramps up dramatically
[14:35:47] <wikibugs>	 10Analytics, 10Event-Platform, 10Operations, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Ottomata) FYI, we've recently added a 'general.yaml' values support to our helm charts repo. This allows us to render values from puppet.  I'd like to accom...
[14:36:31] <ottomata>	 elukey:  and RAM for masters?
[14:36:37] <wikibugs>	 10Analytics, 10Operations: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10akosiaris) >>! In T254125#6181321, @elukey wrote: > This is the current status for eqiad: >  > ` > elukey@ganeti1003:~$ sudo gnt-node list > Node                   DTotal  DFree MTotal MNode...
[14:38:34] <miriam>	 thanks ottomata! 
[14:38:45] <wikibugs>	 10Analytics, 10Event-Platform, 10Operations, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Pchelolo) cc @hnowlan let's see what @Ottomata gets to and see if we can incorporate it into changeprop
[14:43:25] <elukey>	 ottomata: Q1, quick and easy
[14:44:32] <ottomata>	 great ty!
[14:47:56] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Analytics-SWAP, 10Product-Analytics, 10User-Elukey: pip not accessible in new SWAP virtual environments - https://phabricator.wikimedia.org/T247752 (10mpopov) >>! In T247752#6179982, @nshahquinn-wmf wrote: > For some reason, this didn't work for until I put the command...
[14:53:42] <wikibugs>	 10Analytics, 10Operations: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) ` elukey@ganeti1003:~$ sudo gnt-instance modify -B memory=12g an-launcher1001.eqiad.wmnet Modified instance an-launcher1001.eqiad.wmnet  - be/memory -> 12288 Please don't forget that...
[14:54:48] <elukey>	 !log stop all timers on an-launcher1001, prep step for reboot
[14:54:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:59:01] <elukey>	 ah nope! sqoop is runnign
[15:19:18] <icinga-wm>	 PROBLEM - Check the last execution of reportupdater-pingback on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit reportupdater-pingback https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:22:25] <elukey>	 this was me --^
[16:13:20] <icinga-wm>	 RECOVERY - Check the last execution of reportupdater-pingback on an-launcher1001 is OK: OK: Status of the systemd unit reportupdater-pingback https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:18:03] <wikibugs>	 10Analytics, 10Product-Analytics (Kanban): Create Druid tables for Druid datasources in Superset - https://phabricator.wikimedia.org/T251857 (10cchen) 05Open→03Resolved Close the ticket since there's no further updates to this task.
[16:18:05] <wikibugs>	 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics: Experiment with Druid and SqlAlchemy - https://phabricator.wikimedia.org/T249681 (10cchen)
[16:35:35] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Wikimedia-production-error: [beta] EventLogging trying to fetch wrong Schema title - https://phabricator.wikimedia.org/T254058 (10Ottomata) This is due to a change in {T249261}.  I will look into it!
[16:43:17] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Wikimedia-production-error: [beta] EventLogging trying to fetch wrong Schema title - https://phabricator.wikimedia.org/T254058 (10Ottomata) a:03Ottomata
[16:45:59] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Wikimedia-production-error: [beta] EventLogging trying to fetch wrong Schema title - https://phabricator.wikimedia.org/T254058 (10Ottomata) Although, I'm not sure what would be trying to reach searchsatisfaction on meta.  The logstash-...
[17:03:55] <mforns>	 milimetric: is miriam's problem related to the change in field type in the sqooping of imagelinks?
[17:17:57] <miriam>	 thanks mforns! millimetric: hi!! Copying the message here, so you don't have to scroll back! Question: how often do we updates hive tables like mediawiki_imagelinks in wmf_raw? It seems that the last update is 12/2019. Is there a place where I can find a more recently updated table?
[17:18:15] <miriam>	 thanks mforns! milimetric: hi!! Copying the message here, so you don't have to scroll back! Question: how often do we updates hive tables like mediawiki_imagelinks in wmf_raw? It seems that the last update is 12/2019. Is there a place where I can find a more recently updated table?
[17:20:51] <milimetric>	 miriam: I think imagelinks is sqooped on an as-needed basis right now, I did it in December for our data party but we didn’t have anyone asking for regular sqoops.  The other tables that are part of the schedule are sqooped monthly.
[17:21:41] <milimetric>	 so if you need a recent version, I can sqoop one :)
[17:22:05] <milimetric>	 but not right right now ‘cause all the other big first-of-the-month jobs are running
[17:24:39] <miriam>	 ooh I see milimetric, thanks!
[17:25:46] <miriam>	 if this is easy to do, we would need this for the intern project, so anytime this week or the next would work, if possible?
[17:26:10] <miriam>	 milimetric - is it better if I open a task?
[17:26:22] * elukey off!
[17:26:47] <milimetric>	 miriam: up to you, sure, do you need it regularly?
[17:27:46] <miriam>	 milimetric: for this intern project, a one-off update would be Ok. But I expect in the future we might need a more regular update
[17:28:45] <wikibugs>	 10Analytics, 10Growth-Team, 10GrowthExperiments, 10Product-Analytics: Homepage: ensure data retention is in line with the guideline exception - https://phabricator.wikimedia.org/T235577 (10nettrom_WMF) 05Open→03Resolved a:03nettrom_WMF This work was done in T243557, closing as resolved. Thanks for th...
[17:28:48] <wikibugs>	 10Analytics, 10GrowthExperiments, 10Product-Analytics, 10Growth-Team (Current Sprint): Homepage: instrumentation - https://phabricator.wikimedia.org/T216586 (10nettrom_WMF)
[17:30:20] <miriam>	 milimetric: so I don't have a sense of the amount of work needed to schedule regular updates. If it isn't, let's do it! Otherwise, a one-off sqooping of  a recent version would work for now.
[17:30:46] <miriam>	 if *it's not too much work
[17:33:02] <milimetric>	 both easy, just take up space.  I’ll turn on regular updates and I’ll come back to you if we need to scale back later
[17:33:35] <miriam>	 fantastic milimetric, many many thanks!!
[17:33:51] <milimetric>	 np, easy peasy
[18:08:09] <wikibugs>	 10Analytics: Decomission notebook hosts - https://phabricator.wikimedia.org/T249752 (10jwang) I have moved my stuffs off the old clients. Thanks.
[18:09:21] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 (10Ottomata) test.wikipeda.org is now successfully sending SearchSatisfaction events via Ev...
[18:11:02] <nuria>	 milimetric: image links should be sqooped regularly, i remember adding it to the sqoop job
[18:11:50] <milimetric>	 nuria yeah I saw that, Marcel’s looking into why it’s not
[18:14:32] <nuria>	 milimetric: i see, there is a more recent version on my database on hive as i scooped it 2020-03
[18:15:04] <nuria>	 milimetric: i think miriam can probably use that one
[18:16:48] <mforns>	 nuria: milimetric: I was looking into that as ops week, and saw that the sqoop worked, but there are missing partitions in the metastore
[18:17:10] <nuria>	 mforns: missing for the past few months?
[18:17:12] <mforns>	 I msck repaired the table and it has now data
[18:17:17] <nuria>	 mforns: i see
[18:17:28] <milimetric>	 hm, I see it was added in April by Joseph, to the sqoop list.  Something wrong with the load oozie not knowing about it mforns?
[18:17:30] <mforns>	 yes, missing for the past few months, but not only that table, but I think all of them
[18:17:56] <nuria>	 mforns: i thought that happened as  part of the scooping job (running msck)
[18:18:26] <mforns>	 I don't know, looking
[18:31:32] <mforns>	 no, it's not on all tables, I was confused, was trying o access data from 2020-05 snapshot, for which the sqoop is still running, so Oozie's mediawiki-history-load (msck) not triggered yet.
[18:32:00] <nuria>	 mforns: k, on metting, let's talk in a bit
[18:32:16] <mforns>	 k
[18:44:05] <wikibugs>	 10Analytics, 10Analytics-Kanban: Table wmf_raw.mediawiki_imagelinks seems to be missing data - https://phabricator.wikimedia.org/T254188 (10mforns)
[18:44:12] <wikibugs>	 10Analytics, 10Product-Analytics: /srv/published should be structured similarly, have identical README across stat hosts describing said structure - https://phabricator.wikimedia.org/T254189 (10mpopov)
[18:45:02] <wikibugs>	 10Analytics, 10Analytics-Kanban: Table wmf_raw.mediawiki_imagelinks seems to be missing data - https://phabricator.wikimedia.org/T254188 (10mforns) The mediawiki-history-load oozie workflow has a typo, which skips that particular table. fixing.
[18:45:30] <wikibugs>	 10Analytics, 10Product-Analytics: /srv/published should be structured similarly, have identical README across stat hosts describing said structure - https://phabricator.wikimedia.org/T254189 (10mpopov)
[18:46:02] <wikibugs>	 (03PS1) 10Mforns: Fix typo that skips imagelinks in mediawiki-history-load [analytics/refinery] - 10https://gerrit.wikimedia.org/r/601387 (https://phabricator.wikimedia.org/T254188)
[19:08:16] <wikibugs>	 10Analytics, 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Hive access for Sam Patton - https://phabricator.wikimedia.org/T248097 (10mpopov) I've reached out to @spatton on Slack about this.
[19:53:41] <wikibugs>	 10Analytics, 10Analytics-Kanban: Add sqooped imagelinks table to oozie load job for hive to show new snapshots - https://phabricator.wikimedia.org/T254191 (10JAllemandou)
[19:53:52] <joal>	 quick update on imagelink table milimetric and miriam: data is present since last month, but the hive table is not updated - I created T254191 to make sure we don't forget
[19:53:52] <stashbot>	 T254191: Add sqooped imagelinks table to oozie load job for hive to show new snapshots - https://phabricator.wikimedia.org/T254191
[19:55:55] <wikibugs>	 10Analytics, 10Analytics-Kanban: Add sqooped imagelinks table to oozie load job for hive to show new snapshots - https://phabricator.wikimedia.org/T254191 (10Miriam) Thanks for this @JAllemandou !
[19:58:00] <wikibugs>	 10Analytics, 10Analytics-Kanban: Add sqooped imagelinks table to oozie load job for hive to show new snapshots - https://phabricator.wikimedia.org/T254191 (10JAllemandou) Actually I should have checked before creating this :) The table is already added to the job, and last faull available snapshot is `2020-04`...
[19:58:10] <wikibugs>	 10Analytics, 10Analytics-Kanban: Add sqooped imagelinks table to oozie load job for hive to show new snapshots - https://phabricator.wikimedia.org/T254191 (10JAllemandou) 05Open→03Invalid
[20:10:22] <wikibugs>	 (03CR) 10Nuria: [C: 03+2] Fix typo that skips imagelinks in mediawiki-history-load [analytics/refinery] - 10https://gerrit.wikimedia.org/r/601387 (https://phabricator.wikimedia.org/T254188) (owner: 10Mforns)
[20:26:32] <wikibugs>	 10Analytics, 10Analytics-Kanban: Add sqooped imagelinks table to oozie load job for hive to show new snapshots - https://phabricator.wikimedia.org/T254191 (10Miriam) @JAllemandou thanks. However, if I query the mediawiki_imagelinks table in wmf_raw for a random page, I get "2019-12" in the snapshot field, and...
[20:36:27] <wikibugs>	 10Analytics, 10Analytics-Kanban: Add sqooped imagelinks table to oozie load job for hive to show new snapshots - https://phabricator.wikimedia.org/T254191 (10Miriam) Ignore the message above,  I ran the queries again, and it indeed seems that the  problem has been solved in the past few hours :)  thanks so much!
[20:54:19] <wikibugs>	 10Analytics, 10Discovery, 10Operations, 10Recommendation-API, 10Patch-For-Review: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10dpifke) a:03dpifke
[20:58:32] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Wikimedia-production-error: [beta] EventLogging trying to fetch wrong Schema title - https://phabricator.wikimedia.org/T254058 (10MarcoAurelio) >>! In T254058#6182454, @Ottomata wrote: > Although, I'm not sure what would be trying to r...
[21:01:04] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Wikimedia-production-error: [beta] EventLogging trying to fetch wrong Schema title - https://phabricator.wikimedia.org/T254058 (10Ottomata) Ah!  I see.  This is eventlogging-processor trying to parse an event that it shouldn't. https:/...
[21:37:34] <wikibugs>	 (03CR) 10Nuria: [V: 03+2 C: 03+2] Fix typo that skips imagelinks in mediawiki-history-load [analytics/refinery] - 10https://gerrit.wikimedia.org/r/601387 (https://phabricator.wikimedia.org/T254188) (owner: 10Mforns)