[01:33:29] FIRING: [2x] AlertLintProblem: Linting problems found for HaproxyKafkaDeliveryErrors - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [04:58:16] 06Data-Engineering, 06DBA, 07Schema-change-in-production: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130#10898013 (10Marostegui) [05:33:29] FIRING: [2x] AlertLintProblem: Linting problems found for HaproxyKafkaDeliveryErrors - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [05:38:27] 10Quarry: Quarry down - web service unreachable - https://phabricator.wikimedia.org/T395201#10898038 (10Liz) Okay, this glitch ot fixed late last night but now it's down again! What is the problem here? [06:00:28] 06Data-Engineering, 06DBA, 07Schema-change-in-production: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130#10898066 (10Marostegui) [07:11:18] 10Quarry: quarry is leaking tmp files - https://phabricator.wikimedia.org/T395237#10898144 (10dcaro) Saw another instance of this today, it was paired with redis being down (shown as 'completed', so the pod did not restart by itself). [07:13:25] 10Quarry: Quarry down - web service unreachable - https://phabricator.wikimedia.org/T395201#10898154 (10dcaro) >>! In T395201#10898038, @Liz wrote: > Okay, this glitch got fixed late last night but now it's down again! What is the problem here? > > Edit: @taavi, can you look at this? Just manually restarted r... [07:21:20] 10Quarry: Quarry down - web service unreachable - https://phabricator.wikimedia.org/T395201#10898170 (10Liz) Okay, it's working again. Let's hope this doesn't happen again tomorrow. Thanks! [07:23:16] 10Quarry: Quarry down - web service unreachable - https://phabricator.wikimedia.org/T395201#10898172 (10taavi) 05Open→03Resolved [08:53:34] 06Data-Engineering: AlertLintProblem - https://phabricator.wikimedia.org/T395539#10898507 (10phaultfinder) [09:15:42] 06Data-Engineering, 10observability, 10Observability-Metrics, 10Event-Platform: Produce MediaWiki client emitted operational metrics into Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T390328#10898656 (10fgiunchedi) What comes to mind is Prome... [09:16:37] 10Data-Engineering (Q4 2025 April 1st - June 30th), 10MediaWiki-DomainEvents, 07Epic, 10Event-Platform, and 2 others: PageChangeEventSerializer should deprecate WikiPage and adopt PageIdentity - https://phabricator.wikimedia.org/T396453 (10gmodena) 03NEW [09:23:56] 10Data-Engineering (Q4 2025 April 1st - June 30th), 10MediaWiki-DomainEvents, 07Epic, 10Event-Platform, and 3 others: Testing the domain event refactoring with production data - https://phabricator.wikimedia.org/T394899#10898693 (10gmodena) >>! In T394899#10891114, @daniel wrote: > > Looking at PageChange... [09:31:46] 06Data-Engineering, 06DBA, 07Schema-change-in-production: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130#10898747 (10Marostegui) [09:33:01] 06Data-Engineering, 06DBA, 07Schema-change-in-production: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130#10898752 (10Marostegui) [09:33:29] FIRING: [2x] AlertLintProblem: Linting problems found for HaproxyKafkaDeliveryErrors - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [09:37:56] 10Data-Engineering (Q4 2025 April 1st - June 30th): Determine how many admins are there in English Wikipedia and French Wikipedia from sub-Saharan Africa - https://phabricator.wikimedia.org/T395279#10898786 (10mforns) For the record, this is the query that I used: ` WITH sub_saharan_editors AS ( SELE... [09:51:08] 06Data-Engineering, 06Data-Persistence, 10Data-Platform, 10Data-Services, and 3 others: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372#10898866 (10fnegri) [09:51:40] 06Data-Engineering, 06Data-Persistence, 10Data-Platform, 10Data-Services, and 3 others: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372#10898868 (10fnegri) clouddb1019 upgraded and repooled. [10:39:43] !log add an-conf1004 to zookeeper cluster T374922 [10:39:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:39:46] T374922: Bring an-conf100[4-6] into service to replace an-conf100[1-3] - https://phabricator.wikimedia.org/T374922 [11:09:51] !log roll restart analytics zookeeper T374922 [11:09:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:09:54] T374922: Bring an-conf100[4-6] into service to replace an-conf100[1-3] - https://phabricator.wikimedia.org/T374922 [11:52:27] PROBLEM - Zookeeper Server on an-conf1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [12:09:27] RECOVERY - Zookeeper Server on an-conf1004 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [13:17:14] (03CR) 10Gmodena: [C:03+1] Update parent POM [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1147771 (https://phabricator.wikimedia.org/T367405) (owner: 10Peter Fischer) [13:33:29] FIRING: [2x] AlertLintProblem: Linting problems found for HaproxyKafkaDeliveryErrors - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [14:16:58] !log add an-conf1005 to zookeeper cluster T374922 [14:17:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:17:01] T374922: Bring an-conf100[4-6] into service to replace an-conf100[1-3] - https://phabricator.wikimedia.org/T374922 [14:17:17] 10Data-Engineering (Q4 2025 April 1st - June 30th): Refine to Hive with Airflow – Handle Late-Arrived Events - https://phabricator.wikimedia.org/T370665#10899953 (10JAllemandou) I have done a new analysis of late-events using the `wmf.hdfs_usage` table here are the findings and links to the analysis code (snapsh... [15:05:57] !log roll restart analytics zookeeper to add new host T374922 [15:05:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:05:59] T374922: Bring an-conf100[4-6] into service to replace an-conf100[1-3] - https://phabricator.wikimedia.org/T374922 [15:27:41] 06Data-Engineering, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 10Event-Platform: Inconsistent state visibility for Flink deployments on dse-k8s-eqiad - https://phabricator.wikimedia.org/T395984#10900329 (10Gehel) SRE work on the operator seems to be complete. I'm moving this to done on our side. If more... [15:39:19] 06Data-Engineering, 06Data-Persistence, 10Data-Platform, 10Data-Services, and 3 others: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372#10900430 (10Alien333) Note: it looks like the new MariaDB version errors out on some selects involving assigments (`:=`) that previousl... [15:41:18] 10Data-Engineering (Q4 2025 April 1st - June 30th): Refine to Hive with Airflow – Handle Late-Arrived Events - https://phabricator.wikimedia.org/T370665#10900439 (10JAllemandou) {F62284035} [16:25:25] 10Data-Engineering (Q4 2025 April 1st - June 30th), 10DPE-Mediawiki-Content, 13Patch-For-Review: Investigate reasons for remaining inconsistencies - https://phabricator.wikimedia.org/T385112#10900691 (10xcollazo) > I still want to do a bigger experiment, perhaps with enwiki, to see if we are good here. Did... [16:27:49] 10Data-Engineering (Q4 2025 April 1st - June 30th), 10DPE-Mediawiki-Content, 13Patch-For-Review: Investigate reasons for remaining inconsistencies - https://phabricator.wikimedia.org/T385112#10900701 (10xcollazo) ` ssh an-launcher1002.eqiad.wmnet sudo -u analytics bash kerberos-run-command analytics spark3... [16:34:42] 06Data-Engineering, 06DBA, 07Schema-change-in-production: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130#10900747 (10Marostegui) [16:35:12] 10Data-Engineering (Q4 2025 April 1st - June 30th), 10DPE-Mediawiki-Content, 13Patch-For-Review: Investigate reasons for remaining inconsistencies - https://phabricator.wikimedia.org/T385112#10900753 (10xcollazo) [[ https://airflow.wikimedia.org/dags/table_maintenance_iceberg_weekly/grid?dag_run_id=scheduled... [17:04:30] 06Data-Engineering, 06DBA, 07Schema-change-in-production: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130#10900941 (10Marostegui) [17:33:10] 06Data-Engineering, 06Data-Engineering-Icebox, 06SRE Observability, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: [Data Platform] Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430#10901091 (10BTullis) [17:33:29] FIRING: [2x] AlertLintProblem: Linting problems found for HaproxyKafkaDeliveryErrors - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [17:40:11] 06Data-Engineering: NEW BUG REPORT wmf.nterlanguage_navigation missing mobile data - https://phabricator.wikimedia.org/T396514 (10CMyrick-WMF) 03NEW [17:54:23] 06Data-Engineering, 06Data-Engineering-Radar, 10AbuseFilter, 06Trust and Safety Product Team, and 2 others: AbuseFilter abuse_filter_log table: Store IP addresses as hex values - https://phabricator.wikimedia.org/T395612#10901228 (10Tchanders) [18:59:21] 06Data-Engineering: NEW BUG REPORT wmf.nterlanguage_navigation missing mobile data - https://phabricator.wikimedia.org/T396514#10901457 (10Isaac) Thanks @CMyrick-WMF! Some additional context for the Data Platform/Engineering folks: we were doing some analyses related to language-switching on Wikipedia and came a... [19:00:00] 06Data-Engineering: NEW BUG REPORT wmf.interlanguage_navigation missing mobile data - https://phabricator.wikimedia.org/T396514#10901462 (10Isaac) [19:13:30] 10Data-Engineering (Q4 2025 April 1st - June 30th), 10DPE-Mediawiki-Content, 13Patch-For-Review: Investigate reasons for remaining inconsistencies - https://phabricator.wikimedia.org/T385112#10901527 (10xcollazo) >>! In T385112#10900753, @xcollazo wrote: > [[ https://airflow.wikimedia.org/dags/table_maintena... [20:16:13] 06Data-Engineering, 10LDAP-Access-Requests, 06SRE: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10901815 (10SKivlehan-WMF) Hello! Thank you all for the assistance on this ticket -- I don't seem to have access to Turnilo, attempting to login return... [21:33:29] FIRING: [2x] AlertLintProblem: Linting problems found for HaproxyKafkaDeliveryErrors - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [21:34:31] 10Data-Engineering (Q4 2025 April 1st - June 30th): Unify Airflow's datasets.yaml config files across instances - https://phabricator.wikimedia.org/T383931#10902087 (10amastilovic) The deployment was successful, closing the ticket. [21:34:48] 10Data-Engineering (Q4 2025 April 1st - June 30th): Unify Airflow's datasets.yaml config files across instances - https://phabricator.wikimedia.org/T383931#10902088 (10amastilovic) 05Open→03Resolved