[01:16:46] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 3.826% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [02:27:43] (SystemdUnitFailed) resolved: cleanup_tmpdumps.service Failed on dumpsdata1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:46] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 3.604% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:16:57] 10Data-Engineering (Sprint 6): [Iceberg Migration] Migrate aqs hourly tables to Iceberg - https://phabricator.wikimedia.org/T352669 (10tchin) 05Open→03In progress [06:16:59] 10Data-Engineering, 10Epic: [Iceberg Migration] Apache Iceberg Migration - https://phabricator.wikimedia.org/T333013 (10tchin) [06:34:05] (03PS1) 10TChin: [WIP] Add iceberg version of aqs_hourly table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982869 (https://phabricator.wikimedia.org/T352669) [08:21:05] * brouberol waves good morning [09:16:47] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 3.614% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:20:23] * btullis waves also [09:21:32] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/22 Add our customisatio... [09:31:12] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Discovery-Search (Current work): Load Wikidata split graphs into test servers - https://phabricator.wikimedia.org/T350465 (10dcausse) Load seems to have completed: - wdqs1023: 7.6B triples, load time: 5d,21h - wdqs1024: 7.6B triples, load time: 6d,21h At the gla... [10:38:01] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Create a superset container image using the PipelineLib framework - https://phabricator.wikimedia.org/T352165 (10BTullis) 05Open→03Resolved I think we can call this done now. There will likely be some iteration on the image once we start testing it, but for now I... [10:38:03] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10BTullis) [10:38:49] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710 (10BTullis) [10:49:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:30:19] noob question, I have a spark file, that writes to a directory via "INSERT OVERWRITE DIRECTORY "${destination_directory}"". I'm trying to test the output, ran it on stat1004 and it's not in the host, which node it's residing? [12:31:04] Full run output [12:31:08] https://www.irccloud.com/pastebin/sdtKBx8P/ [12:33:33] Amir1: This may help: https://yarn.wikimedia.org/cluster/app/application_1695896957545_482135 [12:33:58] ah thanks [12:34:39] You can also run `yarn logs ` with that ApplicationId to get the driver and executor logs in full. [12:35:53] As in: `yarn logs -applicationId application_1695896957545_482135` [12:37:04] I can look at your user logs by using the superuser like this: `sudo -u hdfs kerberos-run-command hdfs yarn logs -applicationId application_1695896957545_482135` [12:37:13] IT was actually in stat1004 /mnt/hdfs [12:37:20] mounted to hadoop [12:38:19] You mean you have found your file now? Or are you still looking? [12:40:01] I mean I found it [12:40:25] Cool 👍 [13:07:45] Amir1: /mnt/hdfs is just like an NFS mount, so you can access it from any hadoop client (stat boxes, an-launcher, etc) So like `ls /mnt/hdfs/some/path` is the same as `hdfs dfs -ls /some/path` [13:08:02] yup [13:08:24] way more brittle though :D [13:08:37] I want to test something, that's fine :D [13:14:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:16:47] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 3.4% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:31:18] (03PS1) 10Ladsgroup: querypage: Set the storage to text to avoid double jsoning [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983187 (https://phabricator.wikimedia.org/T309738) [13:31:52] milimetric: A review of this would be appreciated! ^ [13:32:17] oh cool! did that just work then? [13:32:31] (03PS2) 10Ladsgroup: querypage: Set the storage to text to avoid double jsoning [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983187 (https://phabricator.wikimedia.org/T309738) [13:32:33] (03CR) 10Milimetric: [V: 03+2 C: 03+2] querypage: Set the storage to text to avoid double jsoning [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983187 (https://phabricator.wikimedia.org/T309738) (owner: 10Ladsgroup) [13:32:40] yup [13:33:24] ugh, I can't merge the patch :( [13:33:55] Amir1: me neither! There's no submit button... wth [13:34:32] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] querypage: Set the storage to text to avoid double jsoning [analytics/refinery] - 10https://gerrit.wikimedia.org/r/983187 (https://phabricator.wikimedia.org/T309738) (owner: 10Ladsgroup) [13:34:44] something up with gerrit? [13:35:08] I think it was because I made the PS2 right after you made the +2s [13:35:14] I redid it and merged it :P [13:35:51] now we need to trigger another dag run [13:38:01] ok, I'll show you how to do it, it's easy, wanna talk? [13:38:07] https://bit.ly/a-batcave [14:22:49] 10Analytics, 10Data-Engineering-Icebox: Find a strategy to mitigate small-files handling for long-term kept events - https://phabricator.wikimedia.org/T236794 (10Ottomata) [14:23:06] 10Data-Engineering: Rename event_sanitized to event_longterm - https://phabricator.wikimedia.org/T225751 (10Ottomata) [14:23:08] 10Data-Engineering, 10Data Pipelines: [Iceberg] Migrate event_sanitized_iceberg to event_sanitized - https://phabricator.wikimedia.org/T311737 (10Ottomata) [14:23:12] 10Data-Engineering, 10Data Pipelines, 10Epic: [Iceberg] Epic: Icebergify event_sanitized database - https://phabricator.wikimedia.org/T311743 (10Ottomata) [14:26:22] 10Data-Engineering: Consider renaming event and event_sanitized Hive databases - https://phabricator.wikimedia.org/T225751 (10Ottomata) [14:27:24] 10Data-Engineering (Sprint 6): [Event Platform] Review analytics switch approach VarnishKafka -> HAProxy - https://phabricator.wikimedia.org/T353454 (10Ahoelzl) [14:30:09] 10Data-Platform-SRE: Ensure Elastic stack works on bookworm - https://phabricator.wikimedia.org/T353392 (10Gehel) p:05Triage→03Medium [14:30:15] 10Data-Engineering (Sprint 6): [Data Quality] Define concept for Alerting in coordination with SRE - https://phabricator.wikimedia.org/T351093 (10Ahoelzl) [14:34:07] 10Data-Engineering, 10CirrusSearch, 10Discovery-Search (Current work): [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046 (10Ottomata) We can do this now for mediawiki_page_change, but **doing so will cause events to be emitted to all other streams** (e.g. m... [14:36:18] 10Data-Engineering, 10Observability-Logging, 10Traffic: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Ottomata) SRE has been working on a [[ https://docs.google.com/document/d/13oZf2aWAUyCtwscAx1PVY3nxDa3QbJRr70BBE3FxdVU/edit#heading=h.jre5lrxox5qi | nice design do... [14:58:03] 10Data-Engineering, 10CirrusSearch, 10Image-Suggestions, 10Structured-Data-Backlog, 10Discovery-Search (Current work): Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10mfoss... [15:08:30] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 10 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10DAlangi_WMF) [15:42:27] 10Data-Engineering, 10Data Products, 10Structured-Data-Backlog: DagProperties don't automatically update Airflow variables - https://phabricator.wikimedia.org/T348963 (10mfossati) @VirginiaPoundstone, @xcollazo , @mforns : do you think it would be possible to tackle this? [15:55:55] 10Data-Engineering, 10CirrusSearch, 10Image-Suggestions, 10Structured-Data-Backlog, 10Discovery-Search (Current work): Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10dcaus... [16:00:12] 10Data-Engineering, 10CirrusSearch, 10Image-Suggestions, 10Structured-Data-Backlog, 10Discovery-Search (Current work): Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10dcaus... [16:04:03] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 (10brouberol) 05Resolved→03Open [16:04:06] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [16:04:09] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 (10brouberol) I'm reopening this as we decided in T352838 to re-create new keytabs with principal `spark`. We'll need to: - rem... [16:10:47] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [17:16:47] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 3.449% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:13:15] 10Data-Engineering, 10Data Pipelines: Refine: Use Spark SQL instead of Hive JDBC - https://phabricator.wikimedia.org/T209453 (10Ottomata) This was supposed to be fixed in Spark 3 with the new v2 datasource. I just tried to add a field to a nested column on an Iceberg table via spark3-sql CLI: `lang=sql CREAT... [18:14:16] 10Data-Engineering, 10Data Pipelines: Refine: Use Spark SQL instead of Hive JDBC - https://phabricator.wikimedia.org/T209453 (10Ottomata) [20:23:38] 10Data-Platform-SRE: Publish Elastic-related packages for Bookworm - https://phabricator.wikimedia.org/T353481 (10bking) [20:40:02] 10Data-Platform-SRE: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: `wdqs[1009-1010].eqiad.wmnet` - wdqs1009.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager... [20:42:07] 10Data-Platform-SRE: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10RKemper) [20:45:24] 10Data-Platform-SRE: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10RKemper) [20:45:38] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) We've figured out the cause of the issue (thanks @Stevemunene !). The pollers (prometheus blackbox and icinga) do not send an `Accept:... [21:14:28] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [21:16:48] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 3.274% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:59:38] 10Data-Engineering, 10MediaWiki-extensions-EventLogging: Flakey test: EventLoggingTest::testDispatch - https://phabricator.wikimedia.org/T353484 (10Jdlrobson) [22:04:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:40:08] 10Data-Engineering, 10MediaWiki-extensions-EventLogging: Flakey test: EventLoggingTest::testDispatch - https://phabricator.wikimedia.org/T353484 (10Umherirrender) [23:40:27] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10ci-test-error (WMF-deployed Build Failure): EventLoggingTest::testDispatch fails when time ticks within the test run - https://phabricator.wikimedia.org/T353243 (10Umherirrender)