[03:35:20] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: search-drop-query-clicks.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:10:26] 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10TThoabala) [09:35:23] 10Quarry, 10Documentation-Review-Board, 10Key docs update 2021-22: Quarry docs - https://phabricator.wikimedia.org/T307011 (10KBach) 05Open→03In progress [09:43:41] (03PS11) 10RhinosF1: mypy: add to tox [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 [09:48:43] (03CR) 10CI reject: [V: 04-1] mypy: add to tox [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821244 (owner: 10RhinosF1) [09:49:30] 10Data-Engineering, 10Data-Engineering-Operations, 10SRE-Access-Requests: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10Urbanecm) Looks like analytics-privatedata-users request to me. Tagging with #sre-access-requests. [09:58:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:03:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:38:29] 10Data-Engineering, 10Event-Platform Value Stream: [SPIKE][NEEDS GROOMING] Flink enrichment pipline should run on k8 - https://phabricator.wikimedia.org/T315428 (10gmodena) [11:18:09] 10Data-Engineering-Kanban, 10Shared-Data-Infrastructure: Determine IP ranges for dse-k8s cluster - https://phabricator.wikimedia.org/T310169 (10BTullis) [11:42:31] 10Data-Engineering, 10Shared-Data-Infrastructure, 10Epic: Data Infrastructure as a Service MVP - https://phabricator.wikimedia.org/T308317 (10BTullis) [11:43:04] 10Data-Engineering, 10Shared-Data-Infrastructure, 10Epic: Data Infrastructure as a Service MVP - https://phabricator.wikimedia.org/T308317 (10BTullis) [13:06:36] btullis: Hi :] If it's OK with you I will merge and deploy https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/117 so that the last remaining DAG stops failing at SLA miss. [13:07:03] mforns: Yes, absolutely fine by me. [13:07:14] 👍 [13:07:31] 10Data-Engineering, 10Data-Services: Wiki replicas are not fully setup for newly created wikis - https://phabricator.wikimedia.org/T315442 (10Urbanecm) [13:08:52] 10Data-Engineering-Kanban, 10Event-Platform Value Stream (Sprint 00), 10Patch-For-Review: [BUG] jsonschema-tools materializes fields in yaml in a different order than in json files - https://phabricator.wikimedia.org/T308450 (10Ottomata) @JAllemandou @Milimetric @phuedx...what do you think about removing the... [13:12:52] 10Data-Engineering, 10Data-Services: Wiki replicas are not fully setup for newly created wikis - https://phabricator.wikimedia.org/T315442 (10Urbanecm) Further investigation shows that @BTullis likely created the views by running `maintain-views` manually (T305280#8016106, T302798#7881457; T303761#7886862 does... [13:13:19] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 00), 10Patch-For-Review: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10EChetty) [13:16:22] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00): [SPIKE][NEEDS GROOMING] Flink enrichment pipline should run on k8 - https://phabricator.wikimedia.org/T315428 (10gmodena) [13:19:43] !log deployed airflow for https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/117 [13:19:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:22:04] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Spike: [SPIKE] Decide on technical solution for page state stream backfill process - https://phabricator.wikimedia.org/T314389 (10Ottomata) > Are we backfilling both the page state change stream and/or the one with content? I think both > Do... [14:12:00] 10Data-Engineering, 10Data-Services: Wiki replicas are not fully setup for newly created wikis - https://phabricator.wikimedia.org/T315442 (10BTullis) Thanks for the investigation @Urbanecm - You're right that I did run the `maintain-views` manually, without using the cookbook. I clearly didn't know all of the... [14:13:27] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Spike: [SPIKE] Decide on technical solution for page state stream backfill process - https://phabricator.wikimedia.org/T314389 (10xcollazo) >If we can do this with Flink, we should, since then we don't have to maintain 2 codebases that do the s... [14:17:20] 10Data-Engineering, 10Data-Catalog, 10Product-Analytics: Propagate field descriptions from event schemas to metastore - https://phabricator.wikimedia.org/T307040 (10EChetty) [14:57:48] 10Data-Engineering-Kanban, 10Event-Platform Value Stream, 10Metrics-Platform, 10Wikidata, and 5 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10mforns) 🙏 🙏 🙏 [14:58:09] 10Quarry: "Download data -> Excel XLSX" corrupted - https://phabricator.wikimedia.org/T314706 (10rook) This seems to work for me in libre office. Could that work as an alternative for you? I lack Microsoft office to test there. [15:09:58] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00): [SPIKE][NEEDS GROOMING] Assess what is required for the enrichment pipline to run on k8 - https://phabricator.wikimedia.org/T315428 (10gmodena) [15:10:36] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Spike: [SPIKE][NEEDS GROOMING] Assess what is required for the enrichment pipline to run on k8 - https://phabricator.wikimedia.org/T315428 (10gmodena) [15:31:12] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Spike: [SPIKE] Decide on technical solution for page state stream backfill process - https://phabricator.wikimedia.org/T314389 (10gmodena) >>! In T314389#8159625, @tchin wrote: Thanks for this @tchin! Do you have a feeling for the lifecycle... [15:33:00] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Spike: [SPIKE] Decide on technical solution for page state stream backfill process - https://phabricator.wikimedia.org/T314389 (10fkaelin) A couple considerations: - the goal is to have this backfill job be executed only a single time. Finding... [15:39:51] 10Quarry: "Download data -> Excel XLSX" corrupted - https://phabricator.wikimedia.org/T314706 (10Aklapper) No problem loading the file in `libreoffice-calc-7.2.7.2`. Which exact Microsoft Excel version is this about? [15:51:21] 10Data-Engineering: Bump spark to 3.3.0 or later - https://phabricator.wikimedia.org/T315454 (10Antoine_Quhen) [15:51:50] 10Data-Engineering, 10Data Pipelines: Bump spark to 3.3.0 or later - https://phabricator.wikimedia.org/T315454 (10Antoine_Quhen) [15:52:29] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Spike: [SPIKE] Decide on technical solution for page state stream backfill process - https://phabricator.wikimedia.org/T314389 (10Ottomata) Another thought: Since we are talking about only backfilling the 'compacted' (current) page state, we m... [15:58:24] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Spike: [SPIKE] Decide on technical solution for page state stream backfill process - https://phabricator.wikimedia.org/T314389 (10Ottomata) > Would we? stream "generic" code I can think of would mostly be HTTP callbacks to the Action API. If po... [16:25:13] btullis: Thank you for putting so much time into the hdfs packages! I'm seeing a serious problem which I expect will be a problem for your team too as soon as you start moving to bullseye... https://phabricator.wikimedia.org/T310643#8157223 -- can I drop that back in your lap? You're welcome to use my servers to test on (clouddumps100[12].wikimedia.org) [16:27:22] andrewbogott: Yes I'll happily look into it, many thanks. Are you OK if I mount and unmount `/mnt/hdfs` while I look into it? These servers aren't quite in production yet, right? [16:28:56] btullis: that's just fine, but maybe double-check that they're downtimed first :) [16:29:50] Cool, will do :-) [16:35:32] thank you again for working on all this. It's turning out to be 100x as much work as I expected [16:37:41] 10Analytics-Kanban, 10Data-Engineering, 10Event-Platform Value Stream, 10Fundraising-Backlog, and 3 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10phuedx) [16:37:59] 10Data-Engineering-Kanban, 10Event-Platform Value Stream, 10Metrics-Platform, 10Wikidata, and 5 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10phuedx) 05Open→03Resolved Being **bold**. [16:39:04] 10Data-Engineering-Kanban, 10Event-Platform Value Stream, 10Metrics-Platform, 10Wikidata, and 5 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10EChetty) @phuedx Fortune will favour you. [16:39:47] andrewbogott: Often the way :-) I'll have a look right now. [16:43:11] andrewbogott: One small issue to begin with: [16:43:19] https://www.irccloud.com/pastebin/CuFlmJAt/ [16:44:26] Is `/srv/` supposed to be mounted on `/dev/sdb1`? [16:45:24] LOL we were just talking about this in our standup -- bullseye has the new 'feature' of indeterminate drive labels [16:45:32] let me double check and see if that's remotely what I expected :) [16:47:35] clouddumps1002 is set up how we want, so you can switch over to that one while I try to unscramble 1001 [16:47:55] Cool, will do. [16:48:10] Do you have a lock on /etc/fstab? [16:48:21] Not any more [16:53:34] I'm going to reboot clouddumps1001 if that won't mess with you, trying to make sure this mount persists [16:54:03] Yep, be my guest. [16:56:37] (03PS1) 10Nmaphophe: Added ArrayAvgUDF to calculate the average two columns by using an array struct. It also ignores nulls [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/824242 [17:04:59] 1001 seems to be (at least briefly) in a reasonable state so feel free to go back to messing with that one if you need side-by-side comparisons. [17:05:04] * andrewbogott -> lunch [17:33:48] (03PS1) 10Nmaphophe: Added ArrayAvgUDF to calculate the average two columns by using an array struct. It also ignores nulls [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/824247 [18:27:26] 10Data-Engineering, 10Data-Engineering-Operations, 10SRE, 10SRE-Access-Requests: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10cmooney) [18:28:41] 10Data-Engineering, 10Data-Engineering-Operations, 10SRE, 10SRE-Access-Requests: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10cmooney) Hi @JayCano can you approve this request and confirm (if you are aware) that access needs to be given to shell group "analytics-priv... [18:29:33] 10Data-Engineering, 10Data-Engineering-Operations, 10SRE, 10SRE-Access-Requests: Access request to analytics system(s) - https://phabricator.wikimedia.org/T315409 (10cmooney) p:05Triage→03Medium a:03cmooney [18:32:14] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Spike: [SPIKE] Decide on technical solution for page state stream backfill process - https://phabricator.wikimedia.org/T314389 (10gmodena) >>! In T314389#8162256, @Ottomata wrote: >> Would we? stream "generic" code I can think of would mostly b... [18:32:17] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) In the last 2 weeks we had two workshop sessions around this. Notes [[ https://docs.google.c... [18:33:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:41:10] 10Data-Engineering, 10Data-Services: Wiki replicas are not fully setup for newly created wikis - https://phabricator.wikimedia.org/T315442 (10Urbanecm) >>! In T315442#8161895, @BTullis wrote: > Thanks for the investigation @Urbanecm - You're right that I did run the `maintain-views` manually, without using the... [18:42:35] 10Data-Engineering, 10Product-Analytics: PySpark warning messages - https://phabricator.wikimedia.org/T315024 (10Mayakp.wiki) [18:50:20] 10Data-Engineering, 10Data Pipelines: Convert to pure Docker the gitlab CI pipeline to build debianized conda - https://phabricator.wikimedia.org/T315475 (10Antoine_Quhen) [18:50:49] 10Data-Engineering, 10Data Pipelines: Convert to pure Docker the gitlab CI pipeline to build debianized conda - https://phabricator.wikimedia.org/T315475 (10Antoine_Quhen) [18:50:53] 10Data-Engineering, 10Data Pipelines (Sprint 00), 10Patch-For-Review: Install spark3 in analytics clusters - https://phabricator.wikimedia.org/T295072 (10Antoine_Quhen) [18:51:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:21] 10Data-Engineering, 10Data Pipelines: Optimize spark3 conda deb generation - https://phabricator.wikimedia.org/T315478 (10Antoine_Quhen) [19:02:37] 10Data-Engineering, 10Data Pipelines: Optimize spark3 conda deb generation - https://phabricator.wikimedia.org/T315478 (10Antoine_Quhen) [19:02:40] 10Data-Engineering, 10Data Pipelines (Sprint 00), 10Patch-For-Review: Install spark3 in analytics clusters - https://phabricator.wikimedia.org/T295072 (10Antoine_Quhen) [19:50:34] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) [19:51:55] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) [21:10:01] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Spike: [SPIKE] Decide on technical solution for page state stream backfill process - https://phabricator.wikimedia.org/T314389 (10tchin) >>! In T314389#8162144, @gmodena wrote: > Do you have a feel for how mature the iceberg connector is? The... [22:10:08] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Spike: [SPIKE] Decide on technical solution for page state stream backfill process - https://phabricator.wikimedia.org/T314389 (10Ottomata) > What are the constraints on throughput and topic retention policy? Would we have to slow event product... [22:12:17] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Spike: [SPIKE] Decide on technical solution for page state stream backfill process - https://phabricator.wikimedia.org/T314389 (10Ottomata) > What are the constraints on throughput TBD on Kafka cluster and topic partitions I suppose, but we can...