[00:18:58] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:14] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:57] PROBLEM - SSH on an-launcher1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:22:33] RECOVERY - SSH on an-launcher1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:00:41] PROBLEM - SSH on an-launcher1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:02:13] RECOVERY - SSH on an-launcher1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:08:19] PROBLEM - SSH on an-launcher1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:09:49] RECOVERY - SSH on an-launcher1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:24:43] 10Data-Engineering, 10Equity-Landscape: Grants input metric - https://phabricator.wikimedia.org/T309276 (10KCVelaga_WMF) a:05ntsako→03JAnstee_WMF @JAnstee_WMF thanks for signing off the outputs. I forgot to explicitly ping you on this. Please review and confirm for the [[ https://docs.google.com/spreadshee... [09:34:44] We had a very high load event on an-launcher1002, lasting a little over an hour, but it has recovered now. [09:34:46] https://usercontent.irccloud-cdn.com/file/tnN8IMmE/image.png [09:34:57] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-launcher1002&var-datasource=thanos&var-cluster=analytics [09:56:59] 10Data-Engineering, 10Equity-Landscape: Grants output metrics - https://phabricator.wikimedia.org/T306620 (10KCVelaga_WMF) a:05KCVelaga_WMF→03ntsako @JAnstee_WMF: @ntsako will handle column name changes. Ntsako: assigning back to you as this is signed-off. [11:00:45] 10Data-Engineering-Planning: Missconfigured proxies on data-engineering hosts - https://phabricator.wikimedia.org/T326302 (10EChetty) [11:00:49] 10Data-Engineering-Planning, 10Patch-For-Review: Update sqoop for CheckUser table - https://phabricator.wikimedia.org/T326330 (10EChetty) [11:00:51] 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 05): k8s deployment-charts mesh module should allow use of mesh without public_port Service - https://phabricator.wikimedia.org/T326252 (10EChetty) [11:00:53] 10Data-Engineering-Planning, 10Pageviews-API: Provide a mechanism to notify subscribers when page view data is available - https://phabricator.wikimedia.org/T326229 (10EChetty) [11:00:55] 10Data-Engineering-Planning: Check home/HDFS leftovers of akhatun - https://phabricator.wikimedia.org/T326157 (10EChetty) [11:00:57] 10Data-Engineering-Planning, 10Data Pipelines: Event Platform canary events job occasionally fails to retrieve stream config settings - https://phabricator.wikimedia.org/T326002 (10EChetty) [11:00:59] 10Data-Engineering-Planning, 10Product-Analytics: Add TikTok's in-app browser to ua-parser library - https://phabricator.wikimedia.org/T325611 (10EChetty) [11:01:01] 10Data-Engineering-Planning: Check home/HDFS leftovers of toddleroux / ryanmax / afandian2 - https://phabricator.wikimedia.org/T325527 (10EChetty) [11:01:03] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Flink Restart Strategy for Enrichment Service - https://phabricator.wikimedia.org/T325359 (10EChetty) [11:01:05] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10ci-test-error: Use a fake timer in EventBus unit test for PageChangeEventSerializerTest::testCreatePageChangeVisibilityEvent - https://phabricator.wikimedia.org/T325341 (10EChetty) [11:01:07] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Deploy to production k8s - https://phabricator.wikimedia.org/T325307 (10EChetty) [11:01:09] 10Data-Engineering-Planning: Provide aggregated user device data per-country - https://phabricator.wikimedia.org/T325306 (10EChetty) [11:01:11] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Deploy to YARN - https://phabricator.wikimedia.org/T325304 (10EChetty) [11:01:13] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Productionize PyFlink Enrichment Service - https://phabricator.wikimedia.org/T325303 (10EChetty) [11:01:15] 10Data-Engineering-Planning: Replace refinery-source Guava caches by Caffeine - https://phabricator.wikimedia.org/T325266 (10EChetty) [11:01:17] 10Data-Engineering-Planning, 10CirrusSearch, 10Event-Platform Value Stream, 10Discovery-Search (Current work): EventRowTypeInfo should support schema evolution of rows seriliazed in flink application state - https://phabricator.wikimedia.org/T325273 (10EChetty) [11:01:19] 10Analytics-Wikistats, 10Data-Engineering-Planning: Stats page - https://phabricator.wikimedia.org/T324993 (10EChetty) [11:01:21] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Event Driven Data Pipelines should be generated from a template - https://phabricator.wikimedia.org/T324980 (10EChetty) [11:01:23] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [NEEDS GROOMING] Integrate Flink Table API in eventutils-python - https://phabricator.wikimedia.org/T324953 (10EChetty) [11:01:25] 10Data-Engineering-Planning, 10Equity-Landscape: Access input metrics - https://phabricator.wikimedia.org/T324968 (10EChetty) [11:01:27] 10Data-Engineering-Planning, 10SRE-OnFire, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10EChetty) [11:01:37] 10Data-Engineering-Planning, 10Data Pipelines: When moving oozie webrequest-load to airflow/spark avoid the error-check corner case - https://phabricator.wikimedia.org/T324757 (10EChetty) [11:01:41] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): We should provide utilities for local development and unit testing of Python streaming services - https://phabricator.wikimedia.org/T324951 (10EChetty) [11:01:49] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Flink wrappers and helper libraries should be moved into a dedicated git repo with packaging and CI. - https://phabricator.wikimedia.org/T324746 (10EChetty) [11:01:53] 10Data-Engineering-Planning, 10Equity-Landscape: Add country_meta_data - https://phabricator.wikimedia.org/T324681 (10EChetty) [11:01:57] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [EPIC] Streaming and event driven Python services - https://phabricator.wikimedia.org/T324689 (10EChetty) [11:02:01] 10Data-Engineering-Planning, 10Epic, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10EChetty) [11:02:05] 10Data-Engineering-Planning: Document how to show your work in phabricator and/or elsewhere - https://phabricator.wikimedia.org/T324796 (10EChetty) [11:02:09] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [EPIC] Flink Applications on Kubernetes - https://phabricator.wikimedia.org/T324578 (10EChetty) [11:02:13] 10Data-Engineering-Planning, 10SRE-OnFire, 10serviceops, 10Sustainability (Incident Followup): Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10EChetty) [11:02:21] 10Data-Engineering-Planning, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10EChetty) [12:14:00] 10Data-Engineering-Planning, 10Data Pipelines, 10Patch-For-Review: Update sqoop for CheckUser table - https://phabricator.wikimedia.org/T326330 (10EChetty) [12:15:05] 10Data-Engineering-Planning, 10Data Pipelines: Provide aggregated user device data per-country - https://phabricator.wikimedia.org/T325306 (10EChetty) [12:15:42] 10Data-Engineering-Planning, 10Cassandra, 10Data Pipelines: Create puppet defined type for adding/updating/deleting secrets or other small files on HDFS - https://phabricator.wikimedia.org/T323692 (10EChetty) [12:29:34] !log roll restarting aqs servers for to bump up mediawiki_history_snapshot to 2022-12 [12:29:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:44:41] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10ci-test-error: Use a fake timer in EventBus unit test for PageChangeEventSerializerTest::testCreatePageChangeVisibilityEvent - https://phabricator.wikimedia.org/T325341 (10Ladsgroup) In case its impact it's not obvious: This is preventing us to me... [12:44:52] 10Data-Engineering-Planning, 10Data Pipelines, 10Editing-team, 10WMF-General-or-Unknown, 10Wikimedia-production-error: "Invalid revision ID -1" error for VisualEditorFeatureUse events, mostly from officewiki - https://phabricator.wikimedia.org/T322602 (10EChetty) [12:46:09] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10EChetty) [12:48:36] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Replace refinery-source Guava caches by Caffeine - https://phabricator.wikimedia.org/T325266 (10EChetty) [12:49:30] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE-OnFire, 10serviceops, 10Sustainability (Incident Followup): Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10EChetty) [12:50:13] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE-OnFire, 10serviceops, and 2 others: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10EChetty) [12:50:47] 10Data-Engineering, 10Equity-Landscape: Access input metrics - https://phabricator.wikimedia.org/T324968 (10EChetty) [12:51:05] 10Data-Engineering, 10Equity-Landscape: Add country_meta_data - https://phabricator.wikimedia.org/T324681 (10EChetty) [12:51:43] 10Data-Engineering-Planning, 10Data Pipelines, 10Product-Analytics: Add TikTok's in-app browser to ua-parser library - https://phabricator.wikimedia.org/T325611 (10EChetty) [12:54:26] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata-Python triggers a Pandas warning during mariadb.run - https://phabricator.wikimedia.org/T324135 (10EChetty) [12:54:36] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Remove Matplotlib as a Wmfdata-Python dependency - https://phabricator.wikimedia.org/T324053 (10EChetty) [12:55:13] 10Analytics-Wikistats, 10Data-Engineering: Stats page - https://phabricator.wikimedia.org/T324993 (10EChetty) [12:55:31] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: Anonymous edits - https://phabricator.wikimedia.org/T323562 (10EChetty) [13:44:19] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10BTullis) Kernel 6.0.12 from backports is no better, unfortunately. {F35982051,width=60%} This has version 42.100.00.00 of... [13:47:31] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Upgrade the WDQS streaming updater to latest flink (1.15) - https://phabricator.wikimedia.org/T289836 (10dcausse) [14:00:25] (03PS2) 10Xcollazo: Modify refinery-drop-older-than to support 'snapshot' partitions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) [14:28:21] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10MatthewVernon) Yeah, I think the takeaway is "you can (no longer) rely on device names being consistent between reboots".... [15:19:47] (03PS6) 10Milimetric: [WIP] Stream revision topics into iceberg table [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/858344 (https://phabricator.wikimedia.org/T322326) [15:24:59] (03CR) 10CI reject: [V: 04-1] [WIP] Stream revision topics into iceberg table [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/858344 (https://phabricator.wikimedia.org/T322326) (owner: 10Milimetric) [15:48:30] (03CR) 10Xcollazo: "Re-verified the changes with patch set 2:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo) [16:02:20] (03CR) 10Xcollazo: Rematerialize all .json files to ensure consistent ordering of fields in yaml and json files (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/823700 (https://phabricator.wikimedia.org/T308450) (owner: 10Ottomata) [16:02:56] (03Abandoned) 10Ottomata: Rematerialize all .json files to ensure consistent ordering of fields in yaml and json files [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/823700 (https://phabricator.wikimedia.org/T308450) (owner: 10Ottomata) [16:03:11] (03CR) 10Ottomata: Rematerialize all .json files to ensure consistent ordering of fields in yaml and json files (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/823700 (https://phabricator.wikimedia.org/T308450) (owner: 10Ottomata) [17:50:29] 10Data-Engineering: Fix anaconda-wmf's setting of REQUESTS_CA_BUNDLE - https://phabricator.wikimedia.org/T306197 (10xcollazo) 05Open→03Declined [17:58:21] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10BTullis) Thanks @MatthewVernon - I've gone with your suggestion, with the only difference being that it's searching for S... [18:22:36] 10Data-Engineering, 10Equity-Landscape: Grants input metric - https://phabricator.wikimedia.org/T309276 (10JAnstee_WMF) @KCVelaga_WMF Yes, I meant to sign off on the inputs from the Inputs QC tab also with the exception of the needed column headers that still need adjusting (pasting below for documentaion here... [20:34:26] (03PS7) 10Milimetric: [WIP] Stream revision topics into iceberg table [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/858344 (https://phabricator.wikimedia.org/T322326) [20:34:36] this one actually works :P ^ [20:34:56] (the build will fail since eventutilities 1.2.2 isn't released yet and I'm using a locally-installed snapshot) [20:37:14] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Spark Streaming Dumps POC: Backfill metadata table - https://phabricator.wikimedia.org/T323642 (10Milimetric) [20:40:30] (03CR) 10CI reject: [V: 04-1] [WIP] Stream revision topics into iceberg table [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/858344 (https://phabricator.wikimedia.org/T322326) (owner: 10Milimetric) [20:48:51] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10ci-test-error: Use a fake timer in EventBus unit test for PageChangeEventSerializerTest::testCreatePageChangeVisibilityEvent - https://phabricator.wikimedia.org/T325341 (10Ottomata) Hm. The test here isn't relying on two different timers. This r... [20:50:59] milimetric: we can release [20:52:08] ottomata: no rush either way, I'm fine with it failing until you release [20:52:23] got most of my pipeline to work, gonna look at Flink again on Monday I think