[00:27:38] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Support importing a Parquet file into HDFS using wmfdata-python - https://phabricator.wikimedia.org/T273196 (10nshahquinn-wmf) a:05nshahquinn-wmf→03None The [draft pull request](https://github.com/wikimedia/wmfdata-python/pull/25) is still there,...
[06:48:03] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Reimage WDQS servers to Bullseye - https://phabricator.wikimedia.org/T328325 (10MoritzMuehlenhoff) But are the still 23 more servers running Buster, are these meant to be phaseed out/decommissioned or so?  ` jmm@cumin2002:~$ sudo  cumin A:wdqs-all 'cat /e...
[07:36:40] <wikibugs>	 (03CR) 10Peter Fischer: [C: 03+1] Add mediawiki/cirrussearch/page-rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[08:05:42] <jinxer-wm>	 (SystemdUnitFailed) firing: ifup@ens13.service Failed on schema2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:09:44] <jinxer-wm>	 (EventLoggingKafkaLag) firing: Kafka consumer lag for event logging over threshold for past 15 min. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&prometheus=ops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=eventlogging_processor_client_side_00 - https://alerts.wikimedia.org/?q=alertname%3DEventLoggingKafkaLag
[08:10:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: ifup@ens13.service Failed on schema2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:30:08] <btullis>	 I'm looking into the `EventLoggingKafkaLag` alert above. This is the first time I've  come across it. The best reference seems to be this: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/EventLogging/Administration#Consumption_Lag_%22Alarms%22:_Burrow
[08:40:52] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10BTullis)
[08:42:09] <btullis>	 I have created the following ticket about it: https://phabricator.wikimedia.org/T341551 - I'm not even familiar enough with it to be able to triage it properly at the moment, so any input from those with experience of it would be welcome.
[08:46:27] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10BTullis) Maybe it's something to do with kafkamon1003 and a failed burrow service. {F37135755,width=60%} https://i...
[08:53:33] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10BTullis) I'm going to restart this `burrow-jumbo-eqiad` service on kafkamon1003. ` btullis@kafkamon1003:~$ systemc...
[08:54:09] <btullis>	 !log `systemctl start burrow-jumbo-eqiad.service` on kafkamon1003 for T341551
[08:54:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:54:13] <stashbot>	 T341551: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551
[08:56:21] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10BTullis) OK, the restart of that services failed: Lot sof errors, but this one stood out. ` Jul 11 08:53:47 kafkam...
[08:59:34] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10ops-monitoring-bot) Host rebooted by btullis@cumin1001 with reason: Rebooting to troubleshoot errors restarting bu...
[08:59:37] <btullis>	 !log rebooting kafkamon1003 
[08:59:38] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:00:17] <aqu>	 Thank you Btullis.
[09:01:14] <btullis>	 A pleasure aqu. This felt like something that would be a good thing for me to look into.
[09:04:44] <jinxer-wm>	 (EventLoggingKafkaLag) resolved: Kafka consumer lag for event logging over threshold for past 15 min. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&prometheus=ops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=eventlogging_processor_client_side_00 - https://alerts.wikimedia.org/?q=alertname%3DEventLoggingKafkaLag
[09:07:05] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10jcrespo) @Ladsgroup The alert is still going on (more recently 2 hours ago)- if you confirm data has all been correctly sanitized, there could be a dat...
[09:08:38] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10jcrespo) @BTullis Please don't proceed until this is clarified.
[09:13:15] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10BTullis) >>! In T338678#9004315, @jcrespo wrote: > @BTullis Please don't proceed until this is clarified.  Acknowledged. I'll await verification that t...
[09:17:27] <wikibugs>	 (03PS1) 10Jennifer Ebe: T340880 Merge visibility changes into hourly target table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937047
[09:18:18] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10BTullis) p:05Triage→03High A fresh boot didn't fix the problem with the service starting. When burrow starts t...
[09:18:42] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:19:43] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:27:49] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10Ladsgroup) I'm 100% sure I ran the clean up in both eqiad and codfw sanitariums and result of `check_private_data.py -S /run/mysqld/mysqld.s5.sock` was...
[09:30:33] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:31:04] <btullis>	 elukey: Have you got much experience of burrow? Re an alert: T341551
[09:31:05] <stashbot>	 T341551: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551
[09:32:02] <elukey>	 btullis: o/ some, it never really caused issues in the past
[09:32:04] <elukey>	 checking
[09:32:26] <btullis>	 Thanks. 
[09:33:41] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10Ladsgroup) >>! In T338678#9004409, @Ladsgroup wrote: > I'm 100% sure I ran the clean up in both eqiad and codfw sanitariums and result of `check_privat...
[09:33:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:35:20] <elukey>	 wow very weird
[09:35:26] <elukey>	 I see a ton of 
[09:35:27] <elukey>	 Jul 11 09:34:50 kafkamon1003 Burrow[2595]: 2023/07/11 09:34:50 http: Accept error: accept tcp [::]:8100: accept4: too many open files; retrying in 40ms
[09:35:32] <elukey>	 in all the burrow units
[09:35:52] <btullis>	 Ah, I hadn't checked the other units. Is this after the reboot as well?
[09:36:23] <elukey>	 yes I think so
[09:36:54] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10Ladsgroup) I have to debug how the whole thing is working. Give me a bit.
[09:37:44] <btullis>	 Maybe there is something we can do with a `ulimit` or similar?
[09:38:21] <elukey>	 there are ~2k connections in TIME_WAIT for that port, all localhost -> localhost, but we should have space in theory
[09:39:45] <elukey>	 trying to restart burrow-main
[09:39:56] <elukey>	 yeah errors again
[09:41:09] <elukey>	 has anything changed recently with the VM etc..?
[09:41:31] <btullis>	 Not as far as I know. It had 68 days' uptime before I restarted it.
[09:42:46] <joal>	 elukey, btullis: Could it be related to the kafka certs changes done lately?
[09:43:21] <elukey>	 codfw doesn't show this issue afaics
[09:43:34] <elukey>	 joal: o/ in theory no, burrow uses plaintext ports afaics
[09:43:42] <joal>	 ack
[09:45:29] <elukey>	 the jumbo burrow instance goes oom when starting its kafka client
[09:47:08] <btullis>	 I added a topic to kafka-jumbo yesterday by the name of `_schemas`
[09:48:13] <elukey>	 btullis: I checked https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&prometheus=ops&var-cluster=jumbo-eqiad&var-topic=All&var&var-datasource=eqiad%20prometheus%2Fops&var-consumer_group=eventlogging_processor_client_side_00 and there was a gigantic jump in offset increment for eventlogging today
[09:48:35] <elukey>	 could it be that burrow tries to load it, for some reason, failing?
[09:50:58] <btullis>	 Yes, that seems very likely. I hadn't spotted that the second graph is a rate, so it's a big jump in offset.
[09:53:19] <elukey>	 I don't recall exactly the source topic for the eventlogging's processors, maybe there was a jump in traffic?
[09:53:23] <elukey>	 sounds weird though
[09:54:51] <dcausse>	 elukey: seen OOM in kafka-clients when mixing up PLAINTEXT and TLS connection
[09:56:01] <btullis>	 elukey: I'm going to have to come back to this later. I'm getting towards my maintenance window for datahub, so I need to focus on that for a while.
[09:56:42] <elukey>	 sure
[09:56:53] <elukey>	 dcausse: ack thanks, in theory from the configs burrow uses plain text
[09:58:15] <dcausse>	 kk, saw this in situation like this: https://issues.apache.org/jira/browse/KAFKA-4090
[10:02:15] <elukey>	 thanks!
[10:02:17] <elukey>	 this one is very weird
[10:07:40] <elukey>	 btullis: I added LimitNOFILE=8192 to the burrow main instance and it seems to have worked with the open files error
[10:08:44] <elukey>	 so we'll probably file a separate change for it, jumbo still doesn't work
[10:10:30] <elukey>	 no big jumps in traffic for eventlogging-client-side, the increase in offset committed for the eventlogging consumer group is weird
[10:11:01] <wikibugs>	 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10fnegri)
[10:13:30] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) I've taken an on-disk backup of each of the datahubsearch nodes in seuqnce by doing the following: ` sudo depool sudo systemctl stop opensearch_1@datahub.service sudo tar czf ~/datahubsea...
[10:16:11] <elukey>	 going afk, I need to run some errands, I disabled puppet on 1003 to leave the fixes there
[10:16:45] <btullis>	 elukey: Ack, thanks.
[10:17:13] <elukey>	 ah wait now it seems to work
[10:17:39] <elukey>	 I excluded the eventlogging consumer groups in the burrow jumbo's config
[10:17:46] <elukey>	 and I don't see the OOMs anymore
[10:17:55] <elukey>	 it is definitely that then
[10:17:56] <elukey>	 btullis: --^
[10:18:18] <elukey>	 ah no sorry wrong one sigh
[10:18:27] <elukey>	 nevermind, same error 
[10:18:38] <elukey>	 ttl!
[10:20:16] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) I created a backup of the production datahub database with the following command on db1108 ` btullis@db1108:~$ sudo mysqldump -S /var/run/mysqld/mysqld.analytics_meta.sock --single-transa...
[10:42:13] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) I'm still going to have to configure kafka manually, since we have errors from the kafka-setup job. The list of topics is: ` MetadataAuditEvent_v4 MetadataChangeEvent_v4 FailedMetadataCha...
[10:53:21] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) I have created the missing topic with: ` btullis@kafka-jumbo1001:~$ kafka topics --create --if-not-exists --partitions 1 --replication-factor 3 --config retention.ms=-1 --topic DataHubUpg...
[10:58:55] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) Also, according to [[https://github.com/datahub-project/datahub/blob/v0.10.4/docker/kafka-setup/kafka-setup.sh#L117|the kafka-setup.sh script]] there is no reason why `PlatformEvent_v1` s...
[11:00:09] <btullis>	 !log Proceeding to upgrade datahub in production
[11:00:11] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:56:14] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of paramd - https://phabricator.wikimedia.org/T340427 (10diego) Hi @MoritzMuehlenhoff ,  Yes, please I'll need a copy of all the data both on the stat machines and HDFS  Thanks!
[12:00:50] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) The upgrade has gone well, I think. The only this is that it looks like the sample data I ingested into the staging instance yesterday ended up in production too. {F37135853,width=60%}  I...
[12:02:42] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:04:22] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:16:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:37:42] <elukey>	 still trying to fix the burrow jumbo exporter
[13:40:15] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of paramd - https://phabricator.wikimedia.org/T340427 (10MoritzMuehlenhoff) >>! In T340427#9004966, @diego wrote: > Hi @MoritzMuehlenhoff ,  > Yes, please I'll need a copy of all the data both on the stat machines and HDFS >  > Thanks!  The data is processed by Data...
[13:40:33] <wikibugs>	 10Data-Platform-SRE: null shown in the user profile dropdown in datahub - https://phabricator.wikimedia.org/T327969 (10BTullis) This has now been resolved as part of {T329514} {F37135939,width=70%}
[13:40:43] <wikibugs>	 10Data-Platform-SRE: null shown in the user profile dropdown in datahub - https://phabricator.wikimedia.org/T327969 (10BTullis) 05Open→03Resolved
[13:51:05] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10observability, 10Patch-For-Review: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10elukey)
[13:56:08] <wikibugs>	 10Quarry, 10Cloud-VPS, 10Toolforge, 10WMF-Legal: Potential ambiguities in the Labs Terms of Use - https://phabricator.wikimedia.org/T140486 (10fnegri)
[14:10:00] <wikibugs>	 (03CR) 10D3r1ck01: "@note: node10 or node12 is not supported in CI. So this patch will always fail and would need manual testing." [analytics/aqs] - 10https://gerrit.wikimedia.org/r/937073 (https://phabricator.wikimedia.org/T341038) (owner: 10D3r1ck01)
[14:20:54] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of paramd - https://phabricator.wikimedia.org/T340427 (10diego) I have access to most of the data, I can wait a couple of weeks to get the full dump.
[14:29:00] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) This looks better. It's deleting the existing indices and than issuing and MAE for each aspect. ` 2023-07-11 14:27:18,884 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Starting upgr...
[14:33:54] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) Success! The cleanup job has successfully removed all of the errant data from elasticsearch and rebuilt the indices. {F37136000,width=70%}
[14:36:59] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE)
[14:39:27] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE)
[14:48:52] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE)
[14:49:38] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE)
[14:51:55] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE)
[14:51:58] <wikibugs>	 (03PS1) 10Btullis: Update the datahub packaged environment to v0.10.4 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937137 (https://phabricator.wikimedia.org/T329514)
[14:58:17] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) @Milimetric - I could do with your help to [[https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/DataHub/Upgrading#Deploy_datahub_CLI_tool|update the conda environment]] for the...
[15:09:44] <jinxer-wm>	 (EventLoggingKafkaLag) firing: Kafka consumer lag for event logging over threshold for past 15 min. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&prometheus=ops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=eventlogging_processor_client_side_00 - https://alerts.wikimedia.org/?q=alertname%3DEventLoggingKafkaLag
[15:24:58] <elukey>	 btullis: did you do anything?
[15:25:32] <elukey>	 weird metrics re-appeared, but burrow is still erroring
[15:29:10] <elukey>	 ok I may have found a solution
[15:29:18] <elukey>	 https://github.com/linkedin/Burrow/wiki/Consumer-Kafka
[15:29:31] <elukey>	 I've set start-latest=true
[15:29:44] <jinxer-wm>	 (EventLoggingKafkaLag) resolved: Kafka consumer lag for event logging over threshold for past 15 min. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&prometheus=ops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=eventlogging_processor_client_side_00 - https://alerts.wikimedia.org/?q=alertname%3DEventLoggingKafkaLag
[15:29:45] <btullis>	 Oh great.
[15:30:09] <elukey>	 so it bypassed the big thing that caused the issue, I think
[15:31:16] <elukey>	 let's see if re-running it with old settings works as well
[15:31:55] <elukey>	 nope, it fails
[15:34:42] <jinxer-wm>	 (SystemdUnitFailed) firing: kube-controller-manager.service Failed on dse-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:36:38] <elukey>	 btullis: created https://gerrit.wikimedia.org/r/c/operations/puppet/+/937144, I think it is the best compromise
[15:36:41] <elukey>	 lemme know your thoughts
[15:39:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: kube-controller-manager.service Failed on dse-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:44:07] <btullis>	 elukey: Do you think that with this it would be skipping events? Is there a data loss issue, or am I misunderstanding?
[15:45:07] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10observability, 10Patch-For-Review: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10elukey) 05Open→03Resolved
[15:45:22] <btullis>	 I'm still struggling to get my head around it. 
[15:45:40] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10observability, 10Patch-For-Review: The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold - https://phabricator.wikimedia.org/T341551 (10elukey) To keep archives happy - I think that burrow tried to pull a huge...
[15:46:06] <elukey>	 btullis: burrow reads the commit logs for consumer groups, so yeah starting from latest would skip some data
[15:46:27] <elukey>	 but in theory we don't really care, as long as we have a good picture of what's happening after it get restarted
[15:46:42] <elukey>	 not entirely sure why the eventlogging's commit log went that big
[15:46:47] <btullis>	 OK, got it. Thanks so much for all of your help.
[15:46:57] <elukey>	 <3
[15:47:57] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis)
[15:50:32] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10MoritzMuehlenhoff) >>! In T336286#8990904, @BTullis wrote: >>>! In T336286#8990871, @xcollazo wrote: >>>>! In T336286#8990318, @BTullis wrote: >>> I made apatch to try the upgrade to version...
[15:51:15] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10BTullis) I'm running the `publish-debian-package` pipeline manually on https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/pipelines/21904  When it's complete I will upload th...
[15:53:29] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10BTullis) >>! In T336286#9005977, @MoritzMuehlenhoff wrote: >  > There was a 2.6.3 release, which fixes five additional issues (all marked "low" by upstream, though): Oh, these Airflow releas...
[15:58:16] <btullis>	 milimetric: If you have any time to help out with this, I'd be grateful: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/937137/ I've not built a packaged environment like this before.
[15:59:12] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE)
[15:59:19] <milimetric>	 btullis: oh yeah, sure, you mean building and publishing it to archiva?
[15:59:34] <milimetric>	 you got it, I'll do it and then merge your change when it's done
[15:59:40] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE)
[16:01:11] <btullis>	 milimetric: Thanks so much. I could probably muddle through, but I was confused over my own local conda environment vs pip etc. As you wrote the README, I thought I;d ask for help.
[16:05:01] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) Note that the edit to maintain the floating point number decimal places is achieved by passing tbe `floatfmt` kwarg to [[ https://gi...
[16:05:19] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE)
[16:07:59] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE)
[16:16:49] <milimetric>	 btullis: https://archiva.wikimedia.org/#artifact/datahub/cli/0.10.4
[16:16:54] <milimetric>	 (merging code now)
[16:17:11] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] "packaged and deployed to https://archiva.wikimedia.org/#artifact/datahub/cli/0.10.4" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937137 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[16:30:47] <btullis>	 milimetric: <3 Many thanks.
[16:55:00] <wikibugs>	 (03CR) 10DLynch: [C: 03+1] Fix editattemptstep ref [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/934032 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia)
[17:23:23] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.3 - https://phabricator.wikimedia.org/T336286 (10MoritzMuehlenhoff)
[19:59:25] <wikibugs>	 10Data-Platform-SRE: Datahub user records are not being created after login - https://phabricator.wikimedia.org/T327884 (10BTullis) This is still not resolved by the recent upgrade to DataHub 0.10.4, but we can now press ahead with the plan to switch DataHub authentication to OIDC in T305874
[20:01:11] <wikibugs>	 10Data-Engineering, 10Epic: Data Catalog MVP - https://phabricator.wikimedia.org/T299910 (10BTullis)
[20:01:13] <wikibugs>	 10Data-Platform-SRE, 10Data-Catalog: Review and improve the build process for DataHub containers - https://phabricator.wikimedia.org/T303381 (10BTullis) 05Open→03Resolved
[20:02:50] <wikibugs>	 10Data-Platform-SRE, 10Data-Catalog: Review and improve the build process for DataHub containers - https://phabricator.wikimedia.org/T303381 (10BTullis) I've created this ticket {T341194} for migrating the DataHub build pipeline to GitLab.
[20:04:20] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) p:05Medium→03High Raising the priority of this task.
[20:12:37] <wikibugs>	 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) a:05Stevemunene→03BTullis
[20:14:37] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Emit lineage information about Airflow jobs to DataHub - https://phabricator.wikimedia.org/T312566 (10BTullis)
[20:14:39] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data Pipelines: Integrate Airflow with DataHub - https://phabricator.wikimedia.org/T306977 (10BTullis)
[20:27:59] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Ingest feature Hive schema into datahub - https://phabricator.wikimedia.org/T326598 (10BTullis) @odimitrijevic - Can this ticket be resolved now?  We have metadata about the hive tables: [[https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,wmf.web...
[20:32:56] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Re-enable Public Druid metadata ingestion - https://phabricator.wikimedia.org/T311547 (10BTullis) It seems like this might be a good use-case the for the DataHub Actions Framework: https://datahubproject.io/docs/act-on-metadata/  Currently it seems that we run the druid inde...
[20:36:00] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Establish a Business Glossary - https://phabricator.wikimedia.org/T311524 (10BTullis) 05Open→03Resolved a:03BTullis I think it's fair to say that this is done. https://datahub.wikimedia.org/glossary {F37136335,width=70%}
[21:04:04] <wikibugs>	 10Analytics-Radar, 10Data-Engineering: stat1005: failing systemd job - https://phabricator.wikimedia.org/T330671 (10BTullis) 05Open→03Resolved We can close this ticket. Failed user jupyterhub servers caused some noise, but we have since mitigated that with : {T336951}
[21:05:20] <wikibugs>	 10Analytics, 10Data-Engineering-Icebox: jmx_presto prometheus job down for some an-presto hosts - https://phabricator.wikimedia.org/T327753 (10BTullis) 05Open→03Resolved a:03BTullis This is fixed.
[21:07:47] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Requesting permission to use kafka-main to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10bking)
[21:09:06] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10bking)
[21:12:58] <wikibugs>	 10Analytics, 10Data-Engineering-Icebox, 10Dumps-Generation, 10cloud-services-team: analytics-dumps-fetch-unique_devices.service failing on dumps servers - https://phabricator.wikimedia.org/T318849 (10BTullis) 05Open→03Resolved a:03BTullis I believe that this can be resolved.
[21:13:48] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Mail: kerberos  manage_principals.py emails go to spam - https://phabricator.wikimedia.org/T318155 (10BTullis)
[21:21:19] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Discussion of Event Driven Systems - https://phabricator.wikimedia.org/T290203 (10BTullis) Should we move these links and thoughts to wikitech, or should we leave this ticket open?
[21:21:48] <wikibugs>	 10Data-Engineering-Icebox, 10Product-Analytics, 10Editing-team (Tracking): How often do people try to edit on mobile devices, using the desktop site, at the English Wikipedia? - https://phabricator.wikimedia.org/T288972 (10BTullis)
[21:24:45] <wikibugs>	 10Data-Engineering-Icebox, 10Pageviews-Anomaly, 10Product-Analytics: Analyse possible bot traffic for ptwiki article Ambev - https://phabricator.wikimedia.org/T282502 (10BTullis) Should we close this ticket? We still have the wider topic ticket open? {T138207}
[21:29:01] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Discussion of Event Driven Systems - https://phabricator.wikimedia.org/T290203 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic
[21:29:39] <wikibugs>	 10Data-Engineering, 10Data-Catalog: Ingest feature Hive schema into datahub - https://phabricator.wikimedia.org/T326598 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic
[21:30:31] <wikibugs>	 10Data-Engineering-Icebox, 10Data-Platform-SRE: Upgrade to Kafka MirrorMaker 2 - https://phabricator.wikimedia.org/T277467 (10BTullis)
[21:40:19] <wikibugs>	 10Analytics-Radar, 10Data-Engineering-Icebox, 10Unplanned-Sprint-Work: Filter out bot traffic for all of our metrics - https://phabricator.wikimedia.org/T276308 (10BTullis) Can this ticket be closed now, or is there still more work to be done?
[21:45:19] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, 10Shared-Data-Infrastructure (2022-23 Q4 Wrap up): Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10odimitrijevic) @BTullis do the permissions need to be removed before cl...
[22:01:10] <wikibugs>	 10Data-Engineering, 10AQS2.0, 10PageViewInfo, 10API Platform (AQS 2.0 Roadmap): MediaWiki frequently receives HTTP 500 from AQS (via PageViewInfo extension) - https://phabricator.wikimedia.org/T341634 (10Krinkle)