[00:27:38] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Support importing a Parquet file into HDFS using wmfdata-python - https://phabricator.wikimedia.org/T273196 (10nshahquinn-wmf) a:05nshahquinn-wmf→03None The [draft pull request](https://github.com/wikimedia/wmfdata-python/pull/25) is still there,... [06:48:03] 10Data-Platform-SRE, 10Discovery-Search (Current work): Reimage WDQS servers to Bullseye - https://phabricator.wikimedia.org/T328325 (10MoritzMuehlenhoff) But are the still 23 more servers running Buster, are these meant to be phaseed out/decommissioned or so? ` jmm@cumin2002:~$ sudo cumin A:wdqs-all 'cat /e... [07:36:40] (03CR) 10Peter Fischer: [C: 03+1] Add mediawiki/cirrussearch/page-rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [08:05:42] (SystemdUnitFailed) firing: ifup@ens13.service Failed on schema2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:09:44] (EventLoggingKafkaLag) firing: Kafka consumer lag for event logging over threshold for past 15 min. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&prometheus=ops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=eventlogging_processor_client_side_00 - https://alerts.wikimedia.org/?q=alertname%3DEventLoggingKafkaLag [08:10:42] (SystemdUnitFailed) resolved: ifup@ens13.service Failed on schema2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:08] I'm looking into the `EventLoggingKafkaLag` alert above. This is the first time I've come across it. The best reference seems to be this: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/EventLogging/Administration#Consumption_Lag_%22Alarms%22:_Burrow [08:40:52] 10Data-Engineering, 10Data-Platform-SRE: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10BTullis) [08:42:09] I have created the following ticket about it: https://phabricator.wikimedia.org/T341551 - I'm not even familiar enough with it to be able to triage it properly at the moment, so any input from those with experience of it would be welcome. [08:46:27] 10Data-Engineering, 10Data-Platform-SRE: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10BTullis) Maybe it's something to do with kafkamon1003 and a failed burrow service. {F37135755,width=60%} https://i... [08:53:33] 10Data-Engineering, 10Data-Platform-SRE: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10BTullis) I'm going to restart this `burrow-jumbo-eqiad` service on kafkamon1003. ` btullis@kafkamon1003:~$ systemc... [08:54:09] !log `systemctl start burrow-jumbo-eqiad.service` on kafkamon1003 for T341551 [08:54:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:54:13] T341551: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 [08:56:21] 10Data-Engineering, 10Data-Platform-SRE: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10BTullis) OK, the restart of that services failed: Lot sof errors, but this one stood out. ` Jul 11 08:53:47 kafkam... [08:59:34] 10Data-Engineering, 10Data-Platform-SRE: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10ops-monitoring-bot) Host rebooted by btullis@cumin1001 with reason: Rebooting to troubleshoot errors restarting bu... [08:59:37] !log rebooting kafkamon1003 [08:59:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:00:17] Thank you Btullis. [09:01:14] A pleasure aqu. This felt like something that would be a good thing for me to look into. [09:04:44] (EventLoggingKafkaLag) resolved: Kafka consumer lag for event logging over threshold for past 15 min. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&prometheus=ops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=eventlogging_processor_client_side_00 - https://alerts.wikimedia.org/?q=alertname%3DEventLoggingKafkaLag [09:07:05] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10jcrespo) @Ladsgroup The alert is still going on (more recently 2 hours ago)- if you confirm data has all been correctly sanitized, there could be a dat... [09:08:38] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10jcrespo) @BTullis Please don't proceed until this is clarified. [09:13:15] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10BTullis) >>! In T338678#9004315, @jcrespo wrote: > @BTullis Please don't proceed until this is clarified. Acknowledged. I'll await verification that t... [09:17:27] (03PS1) 10Jennifer Ebe: T340880 Merge visibility changes into hourly target table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937047 [09:18:18] 10Data-Engineering, 10Data-Platform-SRE: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10BTullis) p:05Triage→03High A fresh boot didn't fix the problem with the service starting. When burrow starts t... [09:18:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:49] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10Ladsgroup) I'm 100% sure I ran the clean up in both eqiad and codfw sanitariums and result of `check_private_data.py -S /run/mysqld/mysqld.s5.sock` was... [09:30:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:04] elukey: Have you got much experience of burrow? Re an alert: T341551 [09:31:05] T341551: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 [09:32:02] btullis: o/ some, it never really caused issues in the past [09:32:04] checking [09:32:26] Thanks. [09:33:41] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10Ladsgroup) >>! In T338678#9004409, @Ladsgroup wrote: > I'm 100% sure I ran the clean up in both eqiad and codfw sanitariums and result of `check_privat... [09:33:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:35:20] wow very weird [09:35:26] I see a ton of [09:35:27] Jul 11 09:34:50 kafkamon1003 Burrow[2595]: 2023/07/11 09:34:50 http: Accept error: accept tcp [::]:8100: accept4: too many open files; retrying in 40ms [09:35:32] in all the burrow units [09:35:52] Ah, I hadn't checked the other units. Is this after the reboot as well? [09:36:23] yes I think so [09:36:54] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10Ladsgroup) I have to debug how the whole thing is working. Give me a bit. [09:37:44] Maybe there is something we can do with a `ulimit` or similar? [09:38:21] there are ~2k connections in TIME_WAIT for that port, all localhost -> localhost, but we should have space in theory [09:39:45] trying to restart burrow-main [09:39:56] yeah errors again [09:41:09] has anything changed recently with the VM etc..? [09:41:31] Not as far as I know. It had 68 days' uptime before I restarted it. [09:42:46] elukey, btullis: Could it be related to the kafka certs changes done lately? [09:43:21] codfw doesn't show this issue afaics [09:43:34] joal: o/ in theory no, burrow uses plaintext ports afaics [09:43:42] ack [09:45:29] the jumbo burrow instance goes oom when starting its kafka client [09:47:08] I added a topic to kafka-jumbo yesterday by the name of `_schemas` [09:48:13] btullis: I checked https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&prometheus=ops&var-cluster=jumbo-eqiad&var-topic=All&var&var-datasource=eqiad%20prometheus%2Fops&var-consumer_group=eventlogging_processor_client_side_00 and there was a gigantic jump in offset increment for eventlogging today [09:48:35] could it be that burrow tries to load it, for some reason, failing? [09:50:58] Yes, that seems very likely. I hadn't spotted that the second graph is a rate, so it's a big jump in offset. [09:53:19] I don't recall exactly the source topic for the eventlogging's processors, maybe there was a jump in traffic? [09:53:23] sounds weird though [09:54:51] elukey: seen OOM in kafka-clients when mixing up PLAINTEXT and TLS connection [09:56:01] elukey: I'm going to have to come back to this later. I'm getting towards my maintenance window for datahub, so I need to focus on that for a while. [09:56:42] sure [09:56:53] dcausse: ack thanks, in theory from the configs burrow uses plain text [09:58:15] kk, saw this in situation like this: https://issues.apache.org/jira/browse/KAFKA-4090 [10:02:15] thanks! [10:02:17] this one is very weird [10:07:40] btullis: I added LimitNOFILE=8192 to the burrow main instance and it seems to have worked with the open files error [10:08:44] so we'll probably file a separate change for it, jumbo still doesn't work [10:10:30] no big jumps in traffic for eventlogging-client-side, the increase in offset committed for the eventlogging consumer group is weird [10:11:01] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10fnegri) [10:13:30] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) I've taken an on-disk backup of each of the datahubsearch nodes in seuqnce by doing the following: ` sudo depool sudo systemctl stop opensearch_1@datahub.service sudo tar czf ~/datahubsea... [10:16:11] going afk, I need to run some errands, I disabled puppet on 1003 to leave the fixes there [10:16:45] elukey: Ack, thanks. [10:17:13] ah wait now it seems to work [10:17:39] I excluded the eventlogging consumer groups in the burrow jumbo's config [10:17:46] and I don't see the OOMs anymore [10:17:55] it is definitely that then [10:17:56] btullis: --^ [10:18:18] ah no sorry wrong one sigh [10:18:27] nevermind, same error [10:18:38] ttl! [10:20:16] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) I created a backup of the production datahub database with the following command on db1108 ` btullis@db1108:~$ sudo mysqldump -S /var/run/mysqld/mysqld.analytics_meta.sock --single-transa... [10:42:13] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) I'm still going to have to configure kafka manually, since we have errors from the kafka-setup job. The list of topics is: ` MetadataAuditEvent_v4 MetadataChangeEvent_v4 FailedMetadataCha... [10:53:21] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) I have created the missing topic with: ` btullis@kafka-jumbo1001:~$ kafka topics --create --if-not-exists --partitions 1 --replication-factor 3 --config retention.ms=-1 --topic DataHubUpg... [10:58:55] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) Also, according to [[https://github.com/datahub-project/datahub/blob/v0.10.4/docker/kafka-setup/kafka-setup.sh#L117|the kafka-setup.sh script]] there is no reason why `PlatformEvent_v1` s... [11:00:09] !log Proceeding to upgrade datahub in production [11:00:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:56:14] 10Data-Engineering: Check home/HDFS leftovers of paramd - https://phabricator.wikimedia.org/T340427 (10diego) Hi @MoritzMuehlenhoff , Yes, please I'll need a copy of all the data both on the stat machines and HDFS Thanks! [12:00:50] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) The upgrade has gone well, I think. The only this is that it looks like the sample data I ingested into the staging instance yesterday ended up in production too. {F37135853,width=60%} I... [12:02:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:37:42] still trying to fix the burrow jumbo exporter [13:40:15] 10Data-Engineering: Check home/HDFS leftovers of paramd - https://phabricator.wikimedia.org/T340427 (10MoritzMuehlenhoff) >>! In T340427#9004966, @diego wrote: > Hi @MoritzMuehlenhoff , > Yes, please I'll need a copy of all the data both on the stat machines and HDFS > > Thanks! The data is processed by Data... [13:40:33] 10Data-Platform-SRE: null shown in the user profile dropdown in datahub - https://phabricator.wikimedia.org/T327969 (10BTullis) This has now been resolved as part of {T329514} {F37135939,width=70%} [13:40:43] 10Data-Platform-SRE: null shown in the user profile dropdown in datahub - https://phabricator.wikimedia.org/T327969 (10BTullis) 05Open→03Resolved [13:51:05] 10Data-Engineering, 10Data-Platform-SRE, 10observability, 10Patch-For-Review: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10elukey) [13:56:08] 10Quarry, 10Cloud-VPS, 10Toolforge, 10WMF-Legal: Potential ambiguities in the Labs Terms of Use - https://phabricator.wikimedia.org/T140486 (10fnegri) [14:10:00] (03CR) 10D3r1ck01: "@note: node10 or node12 is not supported in CI. So this patch will always fail and would need manual testing." [analytics/aqs] - 10https://gerrit.wikimedia.org/r/937073 (https://phabricator.wikimedia.org/T341038) (owner: 10D3r1ck01) [14:20:54] 10Data-Engineering: Check home/HDFS leftovers of paramd - https://phabricator.wikimedia.org/T340427 (10diego) I have access to most of the data, I can wait a couple of weeks to get the full dump. [14:29:00] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) This looks better. It's deleting the existing indices and than issuing and MAE for each aspect. ` 2023-07-11 14:27:18,884 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:16 - Starting upgr... [14:33:54] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) Success! The cleanup job has successfully removed all of the errant data from elasticsearch and rebuilt the indices. {F37136000,width=70%} [14:36:59] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) [14:39:27] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) [14:48:52] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) [14:49:38] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) [14:51:55] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) [14:51:58] (03PS1) 10Btullis: Update the datahub packaged environment to v0.10.4 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937137 (https://phabricator.wikimedia.org/T329514) [14:58:17] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) @Milimetric - I could do with your help to [[https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/DataHub/Upgrading#Deploy_datahub_CLI_tool|update the conda environment]] for the... [15:09:44] (EventLoggingKafkaLag) firing: Kafka consumer lag for event logging over threshold for past 15 min. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&prometheus=ops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=eventlogging_processor_client_side_00 - https://alerts.wikimedia.org/?q=alertname%3DEventLoggingKafkaLag [15:24:58] btullis: did you do anything? [15:25:32] weird metrics re-appeared, but burrow is still erroring [15:29:10] ok I may have found a solution [15:29:18] https://github.com/linkedin/Burrow/wiki/Consumer-Kafka [15:29:31] I've set start-latest=true [15:29:44] (EventLoggingKafkaLag) resolved: Kafka consumer lag for event logging over threshold for past 15 min. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&prometheus=ops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=eventlogging_processor_client_side_00 - https://alerts.wikimedia.org/?q=alertname%3DEventLoggingKafkaLag [15:29:45] Oh great. [15:30:09] so it bypassed the big thing that caused the issue, I think [15:31:16] let's see if re-running it with old settings works as well [15:31:55] nope, it fails [15:34:42] (SystemdUnitFailed) firing: kube-controller-manager.service Failed on dse-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:36:38] btullis: created https://gerrit.wikimedia.org/r/c/operations/puppet/+/937144, I think it is the best compromise [15:36:41] lemme know your thoughts [15:39:42] (SystemdUnitFailed) resolved: kube-controller-manager.service Failed on dse-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:44:07] elukey: Do you think that with this it would be skipping events? Is there a data loss issue, or am I misunderstanding? [15:45:07] 10Data-Engineering, 10Data-Platform-SRE, 10observability, 10Patch-For-Review: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10elukey) 05Open→03Resolved [15:45:22] I'm still struggling to get my head around it. [15:45:40] 10Data-Engineering, 10Data-Platform-SRE, 10observability, 10Patch-For-Review: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10elukey) To keep archives happy - I think that burrow tried to pull a huge... [15:46:06] btullis: burrow reads the commit logs for consumer groups, so yeah starting from latest would skip some data [15:46:27] but in theory we don't really care, as long as we have a good picture of what's happening after it get restarted [15:46:42] not entirely sure why the eventlogging's commit log went that big [15:46:47] OK, got it. Thanks so much for all of your help. [15:46:57] <3 [15:47:57] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) [15:50:32] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10MoritzMuehlenhoff) >>! In T336286#8990904, @BTullis wrote: >>>! In T336286#8990871, @xcollazo wrote: >>>>! In T336286#8990318, @BTullis wrote: >>> I made apatch to try the upgrade to version... [15:51:15] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10BTullis) I'm running the `publish-debian-package` pipeline manually on https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/pipelines/21904 When it's complete I will upload th... [15:53:29] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10BTullis) >>! In T336286#9005977, @MoritzMuehlenhoff wrote: > > There was a 2.6.3 release, which fixes five additional issues (all marked "low" by upstream, though): Oh, these Airflow releas... [15:58:16] milimetric: If you have any time to help out with this, I'd be grateful: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/937137/ I've not built a packaged environment like this before. [15:59:12] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) [15:59:19] btullis: oh yeah, sure, you mean building and publishing it to archiva? [15:59:34] you got it, I'll do it and then merge your change when it's done [15:59:40] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) [16:01:11] milimetric: Thanks so much. I could probably muddle through, but I was confused over my own local conda environment vs pip etc. As you wrote the README, I thought I;d ask for help. [16:05:01] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) Note that the edit to maintain the floating point number decimal places is achieved by passing tbe `floatfmt` kwarg to [[ https://gi... [16:05:19] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) [16:07:59] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) [16:16:49] btullis: https://archiva.wikimedia.org/#artifact/datahub/cli/0.10.4 [16:16:54] (merging code now) [16:17:11] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "packaged and deployed to https://archiva.wikimedia.org/#artifact/datahub/cli/0.10.4" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937137 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [16:30:47] milimetric: <3 Many thanks. [16:55:00] (03CR) 10DLynch: [C: 03+1] Fix editattemptstep ref [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/934032 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia) [17:23:23] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.3 - https://phabricator.wikimedia.org/T336286 (10MoritzMuehlenhoff) [19:59:25] 10Data-Platform-SRE: Datahub user records are not being created after login - https://phabricator.wikimedia.org/T327884 (10BTullis) This is still not resolved by the recent upgrade to DataHub 0.10.4, but we can now press ahead with the plan to switch DataHub authentication to OIDC in T305874 [20:01:11] 10Data-Engineering, 10Epic: Data Catalog MVP - https://phabricator.wikimedia.org/T299910 (10BTullis) [20:01:13] 10Data-Platform-SRE, 10Data-Catalog: Review and improve the build process for DataHub containers - https://phabricator.wikimedia.org/T303381 (10BTullis) 05Open→03Resolved [20:02:50] 10Data-Platform-SRE, 10Data-Catalog: Review and improve the build process for DataHub containers - https://phabricator.wikimedia.org/T303381 (10BTullis) I've created this ticket {T341194} for migrating the DataHub build pipeline to GitLab. [20:04:20] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) p:05Medium→03High Raising the priority of this task. [20:12:37] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) a:05Stevemunene→03BTullis [20:14:37] 10Data-Engineering, 10Data-Catalog: Emit lineage information about Airflow jobs to DataHub - https://phabricator.wikimedia.org/T312566 (10BTullis) [20:14:39] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines: Integrate Airflow with DataHub - https://phabricator.wikimedia.org/T306977 (10BTullis) [20:27:59] 10Data-Engineering, 10Data-Catalog: Ingest feature Hive schema into datahub - https://phabricator.wikimedia.org/T326598 (10BTullis) @odimitrijevic - Can this ticket be resolved now? We have metadata about the hive tables: [[https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,wmf.web... [20:32:56] 10Data-Engineering, 10Data-Catalog: Re-enable Public Druid metadata ingestion - https://phabricator.wikimedia.org/T311547 (10BTullis) It seems like this might be a good use-case the for the DataHub Actions Framework: https://datahubproject.io/docs/act-on-metadata/ Currently it seems that we run the druid inde... [20:36:00] 10Data-Engineering, 10Data-Catalog: Establish a Business Glossary - https://phabricator.wikimedia.org/T311524 (10BTullis) 05Open→03Resolved a:03BTullis I think it's fair to say that this is done. https://datahub.wikimedia.org/glossary {F37136335,width=70%} [21:04:04] 10Analytics-Radar, 10Data-Engineering: stat1005: failing systemd job - https://phabricator.wikimedia.org/T330671 (10BTullis) 05Open→03Resolved We can close this ticket. Failed user jupyterhub servers caused some noise, but we have since mitigated that with : {T336951} [21:05:20] 10Analytics, 10Data-Engineering-Icebox: jmx_presto prometheus job down for some an-presto hosts - https://phabricator.wikimedia.org/T327753 (10BTullis) 05Open→03Resolved a:03BTullis This is fixed. [21:07:47] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Requesting permission to use kafka-main to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10bking) [21:09:06] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10bking) [21:12:58] 10Analytics, 10Data-Engineering-Icebox, 10Dumps-Generation, 10cloud-services-team: analytics-dumps-fetch-unique_devices.service failing on dumps servers - https://phabricator.wikimedia.org/T318849 (10BTullis) 05Open→03Resolved a:03BTullis I believe that this can be resolved. [21:13:48] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Mail: kerberos manage_principals.py emails go to spam - https://phabricator.wikimedia.org/T318155 (10BTullis) [21:21:19] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Discussion of Event Driven Systems - https://phabricator.wikimedia.org/T290203 (10BTullis) Should we move these links and thoughts to wikitech, or should we leave this ticket open? [21:21:48] 10Data-Engineering-Icebox, 10Product-Analytics, 10Editing-team (Tracking): How often do people try to edit on mobile devices, using the desktop site, at the English Wikipedia? - https://phabricator.wikimedia.org/T288972 (10BTullis) [21:24:45] 10Data-Engineering-Icebox, 10Pageviews-Anomaly, 10Product-Analytics: Analyse possible bot traffic for ptwiki article Ambev - https://phabricator.wikimedia.org/T282502 (10BTullis) Should we close this ticket? We still have the wider topic ticket open? {T138207} [21:29:01] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Discussion of Event Driven Systems - https://phabricator.wikimedia.org/T290203 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic [21:29:39] 10Data-Engineering, 10Data-Catalog: Ingest feature Hive schema into datahub - https://phabricator.wikimedia.org/T326598 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic [21:30:31] 10Data-Engineering-Icebox, 10Data-Platform-SRE: Upgrade to Kafka MirrorMaker 2 - https://phabricator.wikimedia.org/T277467 (10BTullis) [21:40:19] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Unplanned-Sprint-Work: Filter out bot traffic for all of our metrics - https://phabricator.wikimedia.org/T276308 (10BTullis) Can this ticket be closed now, or is there still more work to be done? [21:45:19] 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, 10Shared-Data-Infrastructure (2022-23 Q4 Wrap up): Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10odimitrijevic) @BTullis do the permissions need to be removed before cl... [22:01:10] 10Data-Engineering, 10AQS2.0, 10PageViewInfo, 10API Platform (AQS 2.0 Roadmap): MediaWiki frequently receives HTTP 500 from AQS (via PageViewInfo extension) - https://phabricator.wikimedia.org/T341634 (10Krinkle)