[00:14:42] RECOVERY - Check the last execution of camus-mediawiki_job on an-launcher1001 is OK: OK: Status of the systemd unit camus-mediawiki_job https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:27:26] PROBLEM - Check the last execution of monitor_refine_eventlogging_analytics on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:40:50] PROBLEM - Check the last execution of reportupdater-published_cx2_translations_mysql on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit reportupdater-published_cx2_translations_mysql https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:53:28] PROBLEM - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:34:56] RECOVERY - Check the last execution of reportupdater-published_cx2_translations_mysql on an-launcher1001 is OK: OK: Status of the systemd unit reportupdater-published_cx2_translations_mysql https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:36:48] RECOVERY - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is OK: OK: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:11:04] PROBLEM - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:18:12] PROBLEM - Check the last execution of monitor_refine_mediawiki_job_events on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:20:00] PROBLEM - Check the last execution of monitor_refine_eventlogging_analytics_failure_flags on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:39:26] PROBLEM - Check the last execution of camus-event_dynamic_stream_configs on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-event_dynamic_stream_configs https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:12:22] PROBLEM - Check the last execution of monitor_refine_mediawiki_job_events on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:14:08] PROBLEM - Check the last execution of monitor_refine_eventlogging_analytics_failure_flags on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:14:32] PROBLEM - Check the last execution of camus-mediawiki_job on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_job https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:22:28] PROBLEM - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_delayed on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:27:58] PROBLEM - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_immediate on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:51:21] 10Quarry, 10Data-Services: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Marostegui) Thank you for your understanding [04:56:43] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Wikimedia-production-error: [beta] EventLogging trying to fetch wrong Schema title - https://phabricator.wikimedia.org/T254058 (10Ammarpad) HTTP Error 400: Bad Request because the url is malformed. It's mixing revid with a page it does... [05:07:22] PROBLEM - Check the last execution of analytics-dumps-fetch-mediawiki_history_dumps on labstore1007 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-mediawiki_history_dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:11:12] PROBLEM - Check the last execution of analytics-dumps-fetch-mediawiki_history_dumps on labstore1006 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-mediawiki_history_dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:15:02] RECOVERY - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is OK: OK: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:22:02] PROBLEM - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_immediate on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:22:08] PROBLEM - Check the last execution of eventlogging_to_druid_netflow_daily on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_daily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:22:16] PROBLEM - Check the last execution of monitor_refine_mediawiki_job_events on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:27:40] PROBLEM - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:52:58] goooood morning [05:52:59] sigh [05:59:45] ah snap the OOM killer [05:59:49] oooofff [06:00:08] RECOVERY - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is OK: OK: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:03:45] !log kill all airflow-related processes on an-launcher1001 - host killing tasks due to OOM [06:03:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:12:46] PROBLEM - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:13:36] RECOVERY - Check the last execution of camus-mediawiki_job on an-launcher1001 is OK: OK: Status of the systemd unit camus-mediawiki_job https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:15:06] so the biggest offenders are RU jobs [06:15:44] RECOVERY - Check the last execution of camus-event_dynamic_stream_configs on an-launcher1001 is OK: OK: Status of the systemd unit camus-event_dynamic_stream_configs https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:16:12] RECOVERY - Check the last execution of eventlogging_to_druid_netflow_daily on an-launcher1001 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_daily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:22:43] 10Analytics, 10Operations: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) p:05Triage→03High [06:22:45] 10Analytics, 10Operations: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) In theory it should be very simple: https://wikitech.wikimedia.org/wiki/Ganeti#Increase/Decrease_CPU/RAM [06:23:36] RECOVERY - Check the last execution of camus-mediawiki_analytics_events on an-launcher1001 is OK: OK: Status of the systemd unit camus-mediawiki_analytics_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:28:48] !log temporary stop of all RU jobs on an-launcher1001 to priviledge camus and others [06:28:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:44:40] memory consumption went down a lot, and also the pagein/pageout metrics [06:44:57] I am wondering if we should create an-launcher1002 only for RU [06:45:11] but IIRC it will go away, so maybe only adding some memory will suffice [06:45:22] but of course we have some contraints in ganeti now [06:53:44] !log re-run virtualpageview-hourly-wf-2020-5-31-19 [06:53:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:04:23] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=an-launcher1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics&from=now-3h&to=now [07:05:16] I checked all the alarms and I think we are good, a lot of spam due to things getting stuck [07:05:27] but please re-check if you have time so we are sure [07:07:43] brb [08:22:05] good morning o/ [08:23:57] I have 3 tasks stuck in spark application, is it possible to manually kill them (so that they can restart) without causing the whole job to fail? these are 3 out of 57000, the rest succeeded. [08:25:04] good morning, never done it so no idea, usually we kill the spark yarn app [08:33:02] 10Analytics, 10Growth-Team, 10GrowthExperiments, 10Product-Analytics: Homepage: ensure data retention is in line with the guideline exception - https://phabricator.wikimedia.org/T235577 (10Aklapper) @nettrom_WMF: The `Due Date` set for this open task is four months ago. Can you please either update or rese... [08:46:50] elukey: Is it possible to login to the worker node and kill the process ? [08:48:00] djellel: this is not a clean process, the analytics team can do it but usually it is better to stop/start the job cleanly [08:49:13] elukey: I understand, that's what I would do in a normal case, but I am trying to save a 15h job that sifted through 8TB :/ [08:49:37] okok, do you know the workers ? [08:50:31] hosts: 1064 1072 1089 [08:51:03] Ids: 69750 100653 105326 [08:51:17] application_1589903254658_45554 [08:51:46] ok on 1064 I have two [08:52:03] two containers I mean [08:52:13] do you want to kill them both or only a specific one? [08:52:20] the one that has a "69750" [08:52:54] mm I have [08:52:55] container_e11_1589903254658_45554_01_000215 [08:53:52] ah maybe they are related, checking [08:54:55] ok done [08:55:07] djellel: can you check if 1064 is unblocked? [08:55:12] if so I'll proceed with the rest [08:56:49] joal: https://grafana.wikimedia.org/d/000000585/hadoop?panelId=25&fullscreen&var-hadoop_cluster=analytics-hadoop&orgId=1&from=now-14d&to=now :( [08:56:56] bonjour (when you are online) [08:56:56] looks like it did the trick [08:57:18] djellel: okok 5 euros fee for each host, shall I proceed with the rest? :D [08:57:55] lol, that's higher than a carousel ticket [08:57:59] ahahha [08:58:05] 1072 should be unblocked as well [08:59:00] and also 89 [09:00:31] djellel: all ok? [09:00:55] elukey: I don't know if my job will succeed, but FYI the tasks relaunched !! thank you so much. [09:06:49] djellel: super, glad that worked [09:48:10] RECOVERY - Check the last execution of analytics-dumps-fetch-mediawiki_history_dumps on labstore1006 is OK: OK: Status of the systemd unit analytics-dumps-fetch-mediawiki_history_dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:48:51] \o/ [09:49:18] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Wikimedia-production-error: [beta] EventLogging trying to fetch wrong Schema title - https://phabricator.wikimedia.org/T254058 (10MarcoAurelio) This is a trace from Logstash-Beta I spotted and reported here. I did not insert anything.... [09:52:36] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: hdfs-rsync of mediawiki history dumps fails due to source not present (yet) - https://phabricator.wikimedia.org/T251858 (10elukey) Seems to work! ` Jun 01 09:51:22 labstore1006 kerberos-run-command[12033]: Ignoring missing hdfs source hdfs:///wmf/data/arc... [09:53:07] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: hdfs-rsync of mediawiki history dumps fails due to source not present (yet) - https://phabricator.wikimedia.org/T251858 (10elukey) [09:58:28] RECOVERY - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_immediate on an-launcher1001 is OK: OK: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:59:52] RECOVERY - Check the last execution of monitor_refine_eventlogging_analytics_failure_flags on an-launcher1001 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_analytics_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:59:58] good [10:02:52] RECOVERY - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_delayed on an-launcher1001 is OK: OK: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:07:16] PROBLEM - Check the last execution of check_webrequest_partitions on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:09:49] 10Analytics, 10Operations: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) This is the current status for eqiad: ` elukey@ganeti1003:~$ sudo gnt-node list Node DTotal DFree MTotal MNode MFree Pinst Sinst ganeti1001.eqiad.wmnet 707.4G 9.... [10:12:06] PROBLEM - Check the last execution of monitor_refine_eventlogging_analytics_failure_flags on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:12:58] yes the check_webrequest_partitions complains since we are behind schedule [10:13:06] but webrequest is being processed [10:13:33] (well text is behind schedule, upload already caught up) [10:40:49] * elukey lunch! [10:55:01] 10Analytics-Kanban, 10Trash: --- DISCUSSED BELOW --- - https://phabricator.wikimedia.org/T114124 (10Aklapper) @Milimetric: If this is still actively used? If yes then this should really be done via two separate columns instead of this non-obvious error-prone concept. [11:54:13] Hi elukey - normally I'm off today, but I'm gonna spend some time helping with the errors and all - what a mess !@ [12:39:12] joal: hola! No problem, all good, please don't work today :) [12:39:16] everything seems under control [12:39:50] elukey: from me chiming in, isuue was created by airflow taking too much mem on an-launcher1001, right? [12:40:51] joal: initially I thought it was airflow, but it was RU [12:41:05] really? I'd be interested to know which querie [12:41:05] I mean the bulk of the problem was RU [12:41:10] ack [12:41:38] elukey: it looks like CPU usage on an-launcher has been steady for the past 2/3 days at ~50%, which was not the case before - Something must have changed [12:41:51] that is airflow [12:42:02] but the issue was memory consumption in this case [12:42:03] Ah ok :) [12:42:14] ok :) [12:42:18] yes yes I noticed that some days ago and forgot to follow up with Marcel [12:42:41] no problemo, I was trying to make sense of ehat happens in my head: ) [12:43:05] yes sorry I made a horrible summary probably [12:43:07] elukey: Now about memory, this is bizarre nontheless - possibly some hive optimization (local-joins) happening on the waorker? [12:43:23] could be yes [12:43:37] elukey: I was not complaining about any summary, I was trying to understand :) you fixed, summary will come after :) [12:44:45] also joal we are using 2PB on hdfs :( [12:44:54] I've seen that elukey [12:45:31] elukey: I'm part of the problem (wikitext fixing, so more data) - But something else is happening that I don't know - I'm investigating [12:45:48] elukey: growing ~70Tb in 1h is not that frequent [12:46:09] joal: you are on holidays, we can manage, please remove your hands from the keyboard :) [12:46:23] (you tell me the same when I do it!) [12:46:47] true [12:47:43] ok, will drop elukey - I still think investigation is needed about the hdfs used-space growth :) [12:48:13] joal: yep will do it with mforns when he comes online, I promise :) [12:48:14] <3 [12:48:43] fyi, I've been trying to process wikitext, unsuccessfully.., maybe this generates some intermediary files? [12:49:22] djellel: hi - it might be related yes - I don't think it would be intermediate files, as those are not on hdfs [12:49:31] djellel: your home is quite big, I was wondering if it was intended or not [12:50:35] djellel: it is around 7TB (that become 21 replicated 3 times) [12:50:54] elukey, djellel : this is due to .Trash (just checked) [12:51:25] 6.1 T 18.3 T /user/dedcode/.Trash [12:51:28] nice L) [12:51:29] :) [12:51:33] djellel: can I delete? [12:51:59] (trash is where files go when you delete them without -skipTrash, and they stay there for a month) [12:52:09] (prevents accidental oooops) [12:52:14] ok I get it now [12:52:23] there is that trash (making 20T) [12:52:56] Plus /user/hive/warehouse/dedcode.db (df2,enwiki_history_part_year,enwiki_history_part_year2 [12:53:23] That sums to ~75Tb - We have it all [12:53:28] yes, nuke it :) [12:53:33] the timestamps match [12:54:04] ok now we now - Onl to decide what to do - Gone for now ;) [12:54:35] !log /user/dedcode/.Trash/* -skipTrash [12:54:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:55:46] ok now I see [12:55:46] 6.1 T 18.3 T /user/hive/warehouse/dedcode.db/df2 [12:55:46] 6.1 T 18.3 T /user/hive/warehouse/dedcode.db/enwiki_history_part_year [12:55:49] 6.1 T 18.3 T /user/hive/warehouse/dedcode.db/enwiki_history_part_year2 [12:55:52] djellel: --^ [12:56:07] just before leaving djellel - It looks like diffs take a lot more space than text itself - brute force without some heuristic won't make it I think :) [12:56:22] elukey: same, failed jobs [12:56:45] joal: don't worry, I am not using the diffmathpatch [12:57:21] it's a lot smaller, for some reason, processing the whole thing gets to 99.99% and then fails [12:57:45] djellel: I'd ask you to nuke the dbs that you don't use if possible, dropping the data etc.. it is quicker and you have more context [12:58:18] joal: lets talk tomorrow if you have time, I feel it's a matter of tuning the job. enjoy your day for now :) [12:58:42] elukey: I actually ran a drop table from hive :/ [13:00:38] elukey: dfs -rm ? [13:02:39] djellel: if you dropped the tables from hive yes, you can remove them with -rm -r -skipTrash etc.. [13:02:50] hdfs dfs rm -r -skipTrash [13:03:51] checking your db [13:04:00] also elukey, just checked sqoop - seems all good this month ;) [13:04:13] joal: \o/ [13:05:42] elukey: done [13:05:57] ok gone for real - will check later tonight [13:09:23] djellel: thanks a lot! [13:18:34] elukey: o/ am filling out quarter estimates in the hw budget sheet [13:18:46] i filled out a couple but you might have better guesses on some than me [13:22:31] ottomata: o/ I had a chat with Faidon this morning, perfect timing since he's checking the sheet in these days, the sooner we havea final version the better [13:23:52] ok great [13:24:07] lets fill in some guesses real quick then? [13:24:16] expansion after bigtop [13:24:26] don't remember exactly when we are trying to do bigtop? [13:24:51] not before Q2 I think [13:25:26] so Q2/Q3 could be good [13:26:01] ok, let's say Q3 [13:26:12] cassandra refresh? [13:26:55] hello elukey, do you have any thoughts what might have happened to netflow data in this interval? https://w.wiki/SVo [13:27:13] 31 May, 18:30-19:00 exactly, 0 data points [13:27:56] I've checked one nfacctd and it was still sending stuff to kafka in this interval [13:32:04] ottomata: Q2 could be doable [13:32:29] k! [13:32:31] thorium replacement? [13:32:43] cdanis: weird, I am wondering if data is in hive/hdfs [13:33:37] i see netflow data in hive in ffor hour=29 at least [13:33:42] ah nice [13:33:42] and hour=18 [13:34:06] and 17 [13:34:11] cdanis: how do you consume the data? [13:34:15] not in hive? [13:34:19] ottomata: for thorium.. Q2? [13:34:32] k! [13:34:35] ottomata: mostly from Druid via turnilo [13:34:41] as far as I know [13:34:41] druid100[1-3] refresh? [13:34:42] ah [13:34:47] is that realtime into druid? [13:34:50] or via hive? [13:34:56] both :) [13:35:02] oh right ok [13:35:03] hmmm [13:35:13] so at the very least the hive part should replace if anything is missing [13:35:16] not sure when that runs [13:35:22] ahh wait ther was a refine error for netflow IIRC [13:35:59] but that was for the 29th [13:36:01] nevermind [13:37:00] ottomata: re druid100[1-3], since we can reimage anytime, I'd say that Q1/Q2 is fine [13:37:15] we'll have to be careful in moving zookeeper to new nodes [13:37:23] but should be easy [13:37:37] but it can wait Q3 [13:37:49] otherwise too many things in Q1/Q2 [13:37:53] ottomata: does it make sense? [13:38:03] 10Analytics, 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Hive access for Sam Patton - https://phabricator.wikimedia.org/T248097 (10Dzahn) Hi, gentle prod. Any updates on this? It has been open for a while now. [13:38:30] cdanis: can you open a task if you have time? Today it has been a bit problematic for us, a lot of things exploding, if this is not urgent I'll try to check tomorrow [13:42:28] ciao analytics :) I hope youare doing well! i have a question! how often do we updates hive tables like mediawiki_imagelinks in wmf_raw? It seems that the last update is 12/2019. Is there a place where I can find a more recently updated table? [13:50:49] elukey: sure, no urgency, will file a task [13:52:18] 10Analytics, 10Operations, 10Traffic: missing wmf_netflow data, 18:30-19:00 May 31 - https://phabricator.wikimedia.org/T254161 (10CDanis) [14:22:23] heya teamm [14:25:30] hellOooO [14:31:29] elukey: cool ya Q3 is good [14:31:40] how about dedicated superset/turnilo? [14:31:49] and RAM for Hadoop masters? [14:32:16] miriam: i don't know about imagelinks, maybe mforns knows or knows who does? [14:33:35] ottomata: Q2 should be fine, we might need it sooner rather than later if people's usage of superset ramps up dramatically [14:35:47] 10Analytics, 10Event-Platform, 10Operations, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Ottomata) FYI, we've recently added a 'general.yaml' values support to our helm charts repo. This allows us to render values from puppet. I'd like to accom... [14:36:31] elukey: and RAM for masters? [14:36:37] 10Analytics, 10Operations: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10akosiaris) >>! In T254125#6181321, @elukey wrote: > This is the current status for eqiad: > > ` > elukey@ganeti1003:~$ sudo gnt-node list > Node DTotal DFree MTotal MNode... [14:38:34] thanks ottomata! [14:38:45] 10Analytics, 10Event-Platform, 10Operations, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Pchelolo) cc @hnowlan let's see what @Ottomata gets to and see if we can incorporate it into changeprop [14:43:25] ottomata: Q1, quick and easy [14:44:32] great ty! [14:47:56] 10Analytics, 10Analytics-Kanban, 10Analytics-SWAP, 10Product-Analytics, 10User-Elukey: pip not accessible in new SWAP virtual environments - https://phabricator.wikimedia.org/T247752 (10mpopov) >>! In T247752#6179982, @nshahquinn-wmf wrote: > For some reason, this didn't work for until I put the command... [14:53:42] 10Analytics, 10Operations: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) ` elukey@ganeti1003:~$ sudo gnt-instance modify -B memory=12g an-launcher1001.eqiad.wmnet Modified instance an-launcher1001.eqiad.wmnet - be/memory -> 12288 Please don't forget that... [14:54:48] !log stop all timers on an-launcher1001, prep step for reboot [14:54:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:59:01] ah nope! sqoop is runnign [15:19:18] PROBLEM - Check the last execution of reportupdater-pingback on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit reportupdater-pingback https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:22:25] this was me --^ [16:13:20] RECOVERY - Check the last execution of reportupdater-pingback on an-launcher1001 is OK: OK: Status of the systemd unit reportupdater-pingback https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:18:03] 10Analytics, 10Product-Analytics (Kanban): Create Druid tables for Druid datasources in Superset - https://phabricator.wikimedia.org/T251857 (10cchen) 05Open→03Resolved Close the ticket since there's no further updates to this task. [16:18:05] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics: Experiment with Druid and SqlAlchemy - https://phabricator.wikimedia.org/T249681 (10cchen) [16:35:35] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Wikimedia-production-error: [beta] EventLogging trying to fetch wrong Schema title - https://phabricator.wikimedia.org/T254058 (10Ottomata) This is due to a change in {T249261}. I will look into it! [16:43:17] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Wikimedia-production-error: [beta] EventLogging trying to fetch wrong Schema title - https://phabricator.wikimedia.org/T254058 (10Ottomata) a:03Ottomata [16:45:59] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Wikimedia-production-error: [beta] EventLogging trying to fetch wrong Schema title - https://phabricator.wikimedia.org/T254058 (10Ottomata) Although, I'm not sure what would be trying to reach searchsatisfaction on meta. The logstash-... [17:03:55] milimetric: is miriam's problem related to the change in field type in the sqooping of imagelinks? [17:17:57] thanks mforns! millimetric: hi!! Copying the message here, so you don't have to scroll back! Question: how often do we updates hive tables like mediawiki_imagelinks in wmf_raw? It seems that the last update is 12/2019. Is there a place where I can find a more recently updated table? [17:18:15] thanks mforns! milimetric: hi!! Copying the message here, so you don't have to scroll back! Question: how often do we updates hive tables like mediawiki_imagelinks in wmf_raw? It seems that the last update is 12/2019. Is there a place where I can find a more recently updated table? [17:20:51] miriam: I think imagelinks is sqooped on an as-needed basis right now, I did it in December for our data party but we didn’t have anyone asking for regular sqoops. The other tables that are part of the schedule are sqooped monthly. [17:21:41] so if you need a recent version, I can sqoop one :) [17:22:05] but not right right now ‘cause all the other big first-of-the-month jobs are running [17:24:39] ooh I see milimetric, thanks! [17:25:46] if this is easy to do, we would need this for the intern project, so anytime this week or the next would work, if possible? [17:26:10] milimetric - is it better if I open a task? [17:26:22] * elukey off! [17:26:47] miriam: up to you, sure, do you need it regularly? [17:27:46] milimetric: for this intern project, a one-off update would be Ok. But I expect in the future we might need a more regular update [17:28:45] 10Analytics, 10Growth-Team, 10GrowthExperiments, 10Product-Analytics: Homepage: ensure data retention is in line with the guideline exception - https://phabricator.wikimedia.org/T235577 (10nettrom_WMF) 05Open→03Resolved a:03nettrom_WMF This work was done in T243557, closing as resolved. Thanks for th... [17:28:48] 10Analytics, 10GrowthExperiments, 10Product-Analytics, 10Growth-Team (Current Sprint): Homepage: instrumentation - https://phabricator.wikimedia.org/T216586 (10nettrom_WMF) [17:30:20] milimetric: so I don't have a sense of the amount of work needed to schedule regular updates. If it isn't, let's do it! Otherwise, a one-off sqooping of a recent version would work for now. [17:30:46] if *it's not too much work [17:33:02] both easy, just take up space. I’ll turn on regular updates and I’ll come back to you if we need to scale back later [17:33:35] fantastic milimetric, many many thanks!! [17:33:51] np, easy peasy [18:08:09] 10Analytics: Decomission notebook hosts - https://phabricator.wikimedia.org/T249752 (10jwang) I have moved my stuffs off the old clients. Thanks. [18:09:21] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 (10Ottomata) test.wikipeda.org is now successfully sending SearchSatisfaction events via Ev... [18:11:02] milimetric: image links should be sqooped regularly, i remember adding it to the sqoop job [18:11:50] nuria yeah I saw that, Marcel’s looking into why it’s not [18:14:32] milimetric: i see, there is a more recent version on my database on hive as i scooped it 2020-03 [18:15:04] milimetric: i think miriam can probably use that one [18:16:48] nuria: milimetric: I was looking into that as ops week, and saw that the sqoop worked, but there are missing partitions in the metastore [18:17:10] mforns: missing for the past few months? [18:17:12] I msck repaired the table and it has now data [18:17:17] mforns: i see [18:17:28] hm, I see it was added in April by Joseph, to the sqoop list. Something wrong with the load oozie not knowing about it mforns? [18:17:30] yes, missing for the past few months, but not only that table, but I think all of them [18:17:56] mforns: i thought that happened as part of the scooping job (running msck) [18:18:26] I don't know, looking [18:31:32] no, it's not on all tables, I was confused, was trying o access data from 2020-05 snapshot, for which the sqoop is still running, so Oozie's mediawiki-history-load (msck) not triggered yet. [18:32:00] mforns: k, on metting, let's talk in a bit [18:32:16] k [18:44:05] 10Analytics, 10Analytics-Kanban: Table wmf_raw.mediawiki_imagelinks seems to be missing data - https://phabricator.wikimedia.org/T254188 (10mforns) [18:44:12] 10Analytics, 10Product-Analytics: /srv/published should be structured similarly, have identical README across stat hosts describing said structure - https://phabricator.wikimedia.org/T254189 (10mpopov) [18:45:02] 10Analytics, 10Analytics-Kanban: Table wmf_raw.mediawiki_imagelinks seems to be missing data - https://phabricator.wikimedia.org/T254188 (10mforns) The mediawiki-history-load oozie workflow has a typo, which skips that particular table. fixing. [18:45:30] 10Analytics, 10Product-Analytics: /srv/published should be structured similarly, have identical README across stat hosts describing said structure - https://phabricator.wikimedia.org/T254189 (10mpopov) [18:46:02] (03PS1) 10Mforns: Fix typo that skips imagelinks in mediawiki-history-load [analytics/refinery] - 10https://gerrit.wikimedia.org/r/601387 (https://phabricator.wikimedia.org/T254188) [19:08:16] 10Analytics, 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Hive access for Sam Patton - https://phabricator.wikimedia.org/T248097 (10mpopov) I've reached out to @spatton on Slack about this. [19:53:41] 10Analytics, 10Analytics-Kanban: Add sqooped imagelinks table to oozie load job for hive to show new snapshots - https://phabricator.wikimedia.org/T254191 (10JAllemandou) [19:53:52] quick update on imagelink table milimetric and miriam: data is present since last month, but the hive table is not updated - I created T254191 to make sure we don't forget [19:53:52] T254191: Add sqooped imagelinks table to oozie load job for hive to show new snapshots - https://phabricator.wikimedia.org/T254191 [19:55:55] 10Analytics, 10Analytics-Kanban: Add sqooped imagelinks table to oozie load job for hive to show new snapshots - https://phabricator.wikimedia.org/T254191 (10Miriam) Thanks for this @JAllemandou ! [19:58:00] 10Analytics, 10Analytics-Kanban: Add sqooped imagelinks table to oozie load job for hive to show new snapshots - https://phabricator.wikimedia.org/T254191 (10JAllemandou) Actually I should have checked before creating this :) The table is already added to the job, and last faull available snapshot is `2020-04`... [19:58:10] 10Analytics, 10Analytics-Kanban: Add sqooped imagelinks table to oozie load job for hive to show new snapshots - https://phabricator.wikimedia.org/T254191 (10JAllemandou) 05Open→03Invalid [20:10:22] (03CR) 10Nuria: [C: 03+2] Fix typo that skips imagelinks in mediawiki-history-load [analytics/refinery] - 10https://gerrit.wikimedia.org/r/601387 (https://phabricator.wikimedia.org/T254188) (owner: 10Mforns) [20:26:32] 10Analytics, 10Analytics-Kanban: Add sqooped imagelinks table to oozie load job for hive to show new snapshots - https://phabricator.wikimedia.org/T254191 (10Miriam) @JAllemandou thanks. However, if I query the mediawiki_imagelinks table in wmf_raw for a random page, I get "2019-12" in the snapshot field, and... [20:36:27] 10Analytics, 10Analytics-Kanban: Add sqooped imagelinks table to oozie load job for hive to show new snapshots - https://phabricator.wikimedia.org/T254191 (10Miriam) Ignore the message above, I ran the queries again, and it indeed seems that the problem has been solved in the past few hours :) thanks so much! [20:54:19] 10Analytics, 10Discovery, 10Operations, 10Recommendation-API, 10Patch-For-Review: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10dpifke) a:03dpifke [20:58:32] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Wikimedia-production-error: [beta] EventLogging trying to fetch wrong Schema title - https://phabricator.wikimedia.org/T254058 (10MarcoAurelio) >>! In T254058#6182454, @Ottomata wrote: > Although, I'm not sure what would be trying to r... [21:01:04] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Wikimedia-production-error: [beta] EventLogging trying to fetch wrong Schema title - https://phabricator.wikimedia.org/T254058 (10Ottomata) Ah! I see. This is eventlogging-processor trying to parse an event that it shouldn't. https:/... [21:37:34] (03CR) 10Nuria: [V: 03+2 C: 03+2] Fix typo that skips imagelinks in mediawiki-history-load [analytics/refinery] - 10https://gerrit.wikimedia.org/r/601387 (https://phabricator.wikimedia.org/T254188) (owner: 10Mforns)