[00:08:27] <icinga-wm>	 RECOVERY - Check unit status of drop_event on an-launcher1002 is OK: OK: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:46:36] <wikibugs>	 10Analytics, 10LDAP-Access-Requests, 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Patch-For-Review: Superset/Turnilo access for User:STei - https://phabricator.wikimedia.org/T282947 (10elukey) 05Open→03Resolved The LDAP user `stei` has a @wikimedia.org email and I see the manager approval f...
[05:46:41] <wikibugs>	 10Analytics, 10LDAP-Access-Requests, 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo (deadline EOD Monday 17) - https://phabricator.wikimedia.org/T282589 (10elukey)
[06:45:21] <joal>	 Good morning!
[06:45:25] * joal is b
[06:45:40] * joal is back at home - But still has typing problems :)
[06:46:25] <elukey>	 bonjour :)
[06:49:06] <joal>	 Bonjour Luca - What's up today?
[06:52:34] <elukey>	 nothing big, I am still reading things about k8s and istio
[06:52:36] <elukey>	 and you??
[07:57:44] <joal>	 Backfilling emails, helping Aisha for WDQS queries analysis, pinging hnowlan about cassandra3 loading (Hi hnowlan :), and possibly some Gobblin once all that is done :)
[07:59:02] <elukey>	 joal: did you see https://scala-lang.org/blog/2021/05/14/scala3-is-here.html?
[07:59:11] <joal>	 I have seen that elukey :)
[08:17:48] <wikibugs>	 (03PS4) 10Ladsgroup: Add script to send ratio to max auto_increment value of tables [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/691242 (https://phabricator.wikimedia.org/T274419)
[08:51:28] <wikibugs>	 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), 10Patch-For-Review, 10WMDE-TechWish (Sprint-2021-02-03): Adjust edit count bucketing for CodeMirror - https://phabricator.wikimedia.org/T273471 (10Lena_WMDE)
[09:07:07] <wikibugs>	 (03PS1) 10Ladsgroup: Introduce edit_count_by_namespace metric [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356)
[09:34:49] <hnowlan>	 hey joal :) welcome back. I tried truncating some tables last week but it didn't work properly - I've completed them just now so the nodes should have little data in most tables. 
[09:34:58] <hnowlan>	 joal: is there any pattern to the tables themselves that fail? 
[09:35:15] <joal>	 Hi hnowlan - There is a pattern in tables yes
[09:38:57] <joal>	 hnowlan: unique-devices.data, top_percountry.data mostly
[09:39:31] <joal>	 We have encountered a pageviews_per_article_flat issue, but I don't know if it is related or not, will try to relaunch
[09:40:00] <joal>	 jobs for unique-devices.data fail almost conistantly
[09:43:57] <joal>	 hnowlan: should we start fresh with new empty tables?
[09:45:07] <joal>	 !log Rerun of cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-5-15
[09:45:10] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:58:20] <hnowlan>	 joal: yep hopefully! not all tables are clear at the moment, just the largest ones but if this run fails I will truncate every single one 
[09:59:53] <hnowlan>	 I have successfully completeted a repair also now that tables are smaller 
[10:03:23] <joal>	 ack hnowlan 
[10:03:52] <joal>	 hnowlan: Trying to run a unique-devices load now (the failure of the larger job was due to C-2 cluster, so unrelated)
[10:14:48] <joal>	 hnowlan: just to be sure before starting: the repair was for unique_devices.data, right?
[10:18:23] <hnowlan>	 joal: it was for all tables but there was a repair undertaken on unique_devices.data
[10:18:42] <joal>	 all of that is done now, right?
[10:18:45] <joal>	 hnowlan: --^
[10:24:18] <hnowlan>	 joal: yep
[10:28:43] <joal>	 !log Restart cassandra-daily-wf-local_group_default_T_unique_devices-2021-5-4 for testing
[10:28:45] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:35:41] <joal>	 hnowlan: test failed for unique-devices
[10:35:55] <joal>	 hnowlan: can we try to drop the keyspace and recreate it?
[10:37:30] <hnowlan>	 joal: yep, will do 
[10:37:38] <joal>	 thanks a lot hnowlan 
[10:41:22] <hnowlan>	 joal: done
[10:41:33] <joal>	 testing anew
[10:41:58] <joal>	 !log Restart cassandra-daily-wf-local_group_default_T_unique_devices-2021-5-4 for testing after drop/create of keyspace
[10:42:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:52:55] <wikibugs>	 10Analytics, 10Analytics-Kanban: Superset Presto LIMIT >10000 error - https://phabricator.wikimedia.org/T282632 (10JAllemandou) Hi @SNowick_WMF, I double checked the number of expected rows and got `11161`, not `80633` as you mentioned.  I ran:  ` WITH parsed_json AS (   SELECT   event.app_install_id AS instal...
[10:54:12] <hnowlan>	 looks like there's some data in that table now at least...
[10:54:40] <joal>	 hnowlan: job failed :)
[10:55:04] <joal>	 so, some data there is, but the job failed- this is unexplainable to me :(
[10:55:56] <joal>	 hnowlan: Dan has a loading job in spark that takes advantage of the spark-cassandra connector which seems not to have the problem
[10:56:28] <joal>	 hnowlan: Given we wanted in the end move to that, I think we're gonna priotize it to now, to unlock the migration
[10:57:45] <hnowlan>	 joal: aww :/ same host/error on the failure? 
[10:57:54] <hnowlan>	 do we know how many records the job *should* have written? 
[10:58:08] <joal>	 hnowlan: I can find the number of records yes
[10:58:14] <joal>	 hnowlan: checking forh ost
[11:00:21] <hnowlan>	 the table has 790 rows in the new cluster 
[11:01:10] <joal>	 hnowlan: host with ClockFactory problem: [client-[/10.64.32.128, /10.64.16.204, /10.64.32.147]]
[11:02:39] <joal>	 hnowlan: 856 rows expected
[11:02:44] <joal>	 we're missing some
[11:03:09] <hnowlan>	 damn, baffling 
[11:03:41] <joal>	 hnowlan: interestingly: 856 - (856/12) ~= 790 :)
[11:04:09] <joal>	 the numbers make sense at least, even if the failure doesn't
[11:04:59] <joal>	 hnowlan: how about a remove of the host? (or instance)?
[11:05:10] <joal>	 for testing purposes obviously
[11:06:22] <hnowlan>	 joal: sure, makes sense - I can do that now if you'd like 
[11:06:35] <joal>	 works for me hnowlan - when it bests suits you
[11:08:44] <hnowlan>	 joal: down now 
[11:08:57] <joal>	 ack hnowlan - trying to relaod
[11:09:23] <joal>	 !log Restart cassandra-daily-wf-local_group_default_T_unique_devices-2021-5-4 for testing after host generating failures has been moved out of cluster
[11:09:24] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:30:14] <joal>	 still failed hnowlan :(
[11:30:23] <hnowlan>	 different hosts listed at least? 
[11:30:33] <joal>	 hnowlan: java.net.UnknownHostException: aqs1013-b.eqiad.wmnet
[11:30:45] <hnowlan>	 O_O 
[11:30:56] <hnowlan>	 whaaat
[11:31:05] <joal>	 and: com.datastax.driver.core.exceptions.TransportException: [/10.64.32.128:9042] Cannot connect
[11:31:11] <joal>	 this one makes more sense --^
[11:31:31] <hnowlan>	 yeah, not being able to connect to the downed host makes sense 
[11:31:40] <hnowlan>	 but resolution errors are a bit more alarming 
[11:32:31] <hnowlan>	 restarting the aqs1012 cassandras for the short-term 
[11:32:43] <hnowlan>	 checking logs on the cassandras to see if there's something weird 
[11:32:53] <joal>	 hnowlan: You know what - After having failed to connect to the host, it then fails with the same as before errors on timestamps :(
[11:33:08] * joal is in complete fog
[11:41:21] <hnowlan>	 it's so weird that that the spark-cassandra connector works fine too
[11:41:43] <hnowlan>	 I wonder if there's some failure (in)tolerance default in the cassandra 3 client we're not seeing 
[11:42:36] <joal>	 hm - we're loading with quorum=LOCAL_QUORUM, and I don't think that's been specified for the spark-cassandra loader
[11:42:40] <joal>	 hnowlan: --^
[11:43:14] <hnowlan>	 ohhh interesting, let me check
[11:43:18] <hnowlan>	 brb
[11:50:42] <wikibugs>	 10Analytics: Superset query timeouts for charts using Druid table - https://phabricator.wikimedia.org/T282618 (10JAllemandou) TL;DR: This problem comes from how queries are translated from SQL to druid-query-plan. I don't have a solution for this :(  The query that times-out is (available using 'view query' in t...
[12:07:43] <wikibugs>	 (03PS1) 10Joal: Add public_cloud info to webrequest in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/692310 (https://phabricator.wikimedia.org/T279380)
[12:09:01] <wikibugs>	 10Analytics, 10SRE, 10Traffic, 10Patch-For-Review: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10JAllemandou) @CDanis the patch for Druid is there - sorry for not having acted quicker.
[12:09:06] <hnowlan>	 LOCAL_QUORUM should be correct, wonder if the spark-cassandra job will fail with that set 
[12:09:16] <hnowlan>	 and/or I wonder what its default is 
[12:09:51] <joal>	 hnowlan: IIRC the default is not local, that's why we specified it, but I'm not sure
[12:10:33] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10SRE, 10Traffic, 10Patch-For-Review: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10JAllemandou) a:03JAllemandou
[12:14:11] <hnowlan>	 yeah the default is ONE 
[12:17:42] * joal grumbles about what else could make the loading fail
[12:26:26] <hnowlan>	 Might be time to call eric, heh
[12:26:28] * hnowlan lunch
[12:31:23] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Wikimedia-Developer-Portal, 10Documentation: Clean up EventLogging Schema: pages on meta - https://phabricator.wikimedia.org/T282584 (10Ottomata) @Krinkle @ori
[12:35:47] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Wikimedia-Developer-Portal, 10Documentation: Clean up EventLogging Schema: pages on meta - https://phabricator.wikimedia.org/T282584 (10Ottomata) Context in comments:  https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/650174...
[12:36:14] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Wikimedia-Developer-Portal, 10Documentation: Clean up EventLogging Schema: pages on meta - https://phabricator.wikimedia.org/T282584 (10Ottomata) We do edit-protect the migrated schemas though, so if there is a way to show that in the schema na...
[12:45:06] <ottomata>	 joal: milimetric razzi FYI in case you haven't seen( (assuming you have): https://docs.google.com/document/d/1ptrSXusPeS-4rO1eu0PSFmxcoP7R8VUV_5fD7KfEEVo/edit#
[12:55:12] <milimetric>	 ottomata: I hadn't seen that yet!  gmodena had talked to us about the AirFlow part of that pipeline, but not the serving layer yet.  Super interesting, fits in with AQS 2.0, but we should definitely get chris a. involved, I'm assuming he has ideas and opinions
[13:09:07] <wikibugs>	 (03PS2) 10Martaannaj: Create wd_propertysuggester/client_ab_testing and wd_propertysuggester/server_ab_testing [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152
[13:09:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Create wd_propertysuggester/client_ab_testing and wd_propertysuggester/server_ab_testing [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 (owner: 10Martaannaj)
[13:36:04] <wikibugs>	 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10akosiaris) >>! In T120242#7088486, @Ottomata wrote: >> Do we have estimations (or even better ha...
[13:48:27] <mforns>	 helloooo
[14:09:08] <wikibugs>	 (03PS3) 10Martaannaj: Create wd_propertysuggester/client_ab_testing and wd_propertysuggester/server_ab_testing [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152
[14:09:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Create wd_propertysuggester/client_ab_testing and wd_propertysuggester/server_ab_testing [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 (owner: 10Martaannaj)
[14:10:09] <elukey>	 hola mforns 
[14:14:04] <wikibugs>	 (03PS4) 10Martaannaj: Create wd_propertysuggester/client_ab_testing and wd_propertysuggester/server_ab_testing [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152
[14:42:09] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Wikimedia-Developer-Portal, 10Documentation: Clean up EventLogging Schema: pages on meta - https://phabricator.wikimedia.org/T282584 (10Krinkle) When a schema ceased to be used (removed, merged with another, moved to other system, moved to Even...
[14:44:58] <hnowlan>	 joal: do you have full pastes of the cassandra write errors anywhere? in phab maybe?
[14:59:00] <joal>	 hnowlan: th
[14:59:04] <joal>	 hnowlan: sorry - again
[14:59:10] <joal>	 hnowlan: the logs are quite big
[14:59:36] <joal>	 hnowlan: I'll provide you a way to access the logs rather than copying
[15:02:00] <hnowlan>	 joal: that'd be great, thanks! 
[15:05:12] <ottomata>	 a-team standuP!
[15:10:18] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review: Update refinery-cassandra dependencies to have support for Cassandra 3 - https://phabricator.wikimedia.org/T280649 (10JAllemandou) @hnowlan : Here is a way to access failure logs from todays job (when host was down): from `an-launcher1002`:  ` sudo -u analytics kerberos-...
[15:10:34] <joal>	 hnowlan: https://phabricator.wikimedia.org/T280649#7093055
[15:11:50] <wikibugs>	 10Analytics, 10Event-Platform: Deploy schema repos to analytics cluster and use local uris for analytics jobs - https://phabricator.wikimedia.org/T280017 (10Ottomata)
[15:12:00] <hnowlan>	 joal: thanks! 
[15:19:19] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] "There's a bit of duplication here because Dashiki builds and pushes Lato locally into the dashboards directory.  But we don't have a good " [analytics/analytics.wikimedia.org] - 10https://gerrit.wikimedia.org/r/688356 (https://phabricator.wikimedia.org/T182804) (owner: 10Razzi)
[15:20:56] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Wikimedia-Developer-Portal, 10Documentation: Clean up EventLogging Schema: pages on meta - https://phabricator.wikimedia.org/T282584 (10Ottomata) I don't love deleting, because as you say it makes it hard to find the past schema versions and ol...
[15:37:02] <wikibugs>	 10Analytics-Clusters: Verify if Superset can authenticate to Druid via TLS/Kerberos - https://phabricator.wikimedia.org/T250487 (10Ottomata) a:05elukey→03None
[15:37:07] <wikibugs>	 10Analytics-Clusters: Verify if Turnilo can pull data from Druid using Kerberos/TLS - https://phabricator.wikimedia.org/T250485 (10Ottomata) a:05elukey→03None
[15:39:09] <wikibugs>	 10Analytics-Radar, 10Wikidata, 10Wikidata-Query-Service: PoC on anomaly detection with Flink - https://phabricator.wikimedia.org/T262942 (10dcausse) a:05dcausse→03None Made https://github.com/nomoa/flink-python-demo but stopped actively working on this for the moment, have hit issues with python env.
[15:40:31] <wikibugs>	 10Analytics, 10Discovery, 10Event-Platform, 10Platform Engineering, and 2 others: Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10odimitrijevic) p:05Triage→03High
[15:42:46] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: LandingPageImpression Event Platform Migration - https://phabricator.wikimedia.org/T282855 (10odimitrijevic) p:05Triage→03High
[15:43:13] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-web1001 - https://phabricator.wikimedia.org/T281787 (10Ottomata)
[15:45:24] <wikibugs>	 10Analytics-Clusters, 10SRE: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10Ottomata) a:03Ottomata
[15:45:33] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10SRE: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10Ottomata)
[15:47:38] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Remove all debian python-* packages installed for analytics clients, use conda instead - https://phabricator.wikimedia.org/T275786 (10Ottomata)
[15:48:20] <hnowlan>	 joal: could you get me the application ID of the earlier job that failed when the host was up please? Or let me know how to find that out :) 
[15:48:44] <joal>	 hnowlan:  application_1620304990193_40608 is the one :)
[15:49:11] <hnowlan>	 thanks! 
[15:50:23] <wikibugs>	 10Analytics: Superset query timeouts for charts using Druid table - https://phabricator.wikimedia.org/T282618 (10odimitrijevic) p:05Triage→03Medium
[15:51:23] <wikibugs>	 10Analytics: Superset query timeouts for charts using Druid table - https://phabricator.wikimedia.org/T282618 (10odimitrijevic) @JAllemandou to file an upstream bug
[15:52:50] <wikibugs>	 10Analytics: Superset query timeouts for charts using Druid table - https://phabricator.wikimedia.org/T282618 (10JAllemandou) Also: One way to get results is to set the `time-grain` to the value: `original value`. This makes calcite use the topN query (single field in group-by instead of two). You'll get daily v...
[16:06:15] <wikibugs>	 10Analytics, 10Discovery, 10Event-Platform, 10Platform Engineering, and 2 others: Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10Ottomata) I'd say this is medium to low priority and is something that needs to be worked on in collaboration with maintaine...
[16:16:17] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Enable shouldGenerateExample [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/691236 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata)
[16:16:57] <wikibugs>	 (03Merged) 10jenkins-bot: Enable shouldGenerateExample [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/691236 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata)
[16:18:26] <wikibugs>	 10Analytics: Superset query timeouts for charts using Druid table - https://phabricator.wikimedia.org/T282618 (10JAllemandou) And the feature request: https://github.com/apache/druid/issues/11264
[16:25:52] <hnowlan>	 joal: so maybe I'm mis-reading the errors but it seems like the consistent error is "Too many open files" rather than connection issues. Could it be that the new version of the Cassandra library pushes us over some limit on the workers themselves? 
[16:26:11] <hnowlan>	 joal: especially the failure in the more recent of epoll_create1() 
[16:26:41] <hnowlan>	 would increasing a ulimit on the anworkers be worth exploring? 
[16:26:53] <joal>	 hnowlan: while I hear your point, how come almost all jobs succeed and 2 fail?
[16:27:44] <joal>	 hnowlan: IIRC ulimit on workers is unlimited
[16:29:26] <hnowlan>	 joal: yeah I don't have a good answer for that yet :/ 
[16:29:40] <hnowlan>	 honestly at the moment this is the only concrete error we have afaics 
[16:30:28] <hnowlan>	 there could be an inotify limit rather than just a file limit
[16:31:26] <joal>	 hnowlan: we have some values here - Max open files            32768                32768                files 
[16:31:29] <joal>	 oops sorry
[16:31:34] <joal>	 https://phabricator.wikimedia.org/T281792
[16:31:36] <joal>	 hnowlan: --^
[16:31:41] <joal>	 so not unlimited :)
[16:31:58] <joal>	 but still very high :(
[16:33:30] <joal>	 hnowlan: the problem we see in the logs of failed jobs versus successfull jobs is the wall of: "INFO [client-[/10.64.32.128, /10.64.16.204, /10.64.32.147]] com.datastax.driver.core.ClockFactory: Using native clock to generate timestamps."
[16:34:14] <joal>	 that always leads to Too many open files
[16:35:13] <joal>	 In successfull jobs there are some of those lines (similiar to the failing job when cluster connections are being built), but then the job doesn't go in loop creating clocks!
[16:36:32] <hnowlan>	 ahhhh I see 
[16:37:01] <joal>	 With that, I actually think that the problem is from threads more than files for real
[16:37:12] <hnowlan>	 so if there is a limit, increasing the limits might actually be making the problem worse to begin with
[16:37:29] <joal>	 Could very well be :S
[16:42:10] <wikibugs>	 (03CR) 10Michael Große: [C: 04-1] "Looks good by itself, but giving it a -1 for now as it is not clear to me yet how this is different from what is done in `recent_changes_b" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356) (owner: 10Ladsgroup)
[16:44:34] <hnowlan>	 part of me keeps thinking there's something wrong with the firewall rules or something but I've quadruple checked the homer rules and it doesn't make sense that there would be a partial success 
[16:45:26] * joal tries to think of what we've not thought about hnowlan :(
[16:49:29] <elukey>	 joal, hnowlan and if we bump the max open files on the failing host as test? 
[16:49:38] <elukey>	 (if we haven't tried)
[16:49:44] <elukey>	 just to understand it that is the issue
[16:49:59] <elukey>	 the cluster is still wip so it is fine to manually change it
[16:50:12] <joal>	 elukey: the change needs to happen on the cluster (failure happens on the lcient)
[16:50:23] <joal>	 elukey: cluster = hadoop
[16:50:25] <joal>	 sorry
[16:50:33] <elukey>	 sigh
[16:51:31] <hnowlan>	 the idea of all of these threads being created in a loop to me seems like there's a silent error being hidden 
[16:51:46] <joal>	 I have a similar feeling hnowlan 
[16:52:25] <joal>	 hnowlan: I'm gonna run a failing job with DEBUG logging, for us to review in more detail if you wish
[16:52:37] <hnowlan>	 joal: wonderful, was just about to ask if we could do that 
[16:53:08] <joal>	 hnowlan: plenty meetings this evening, might be able to do it at the same time, but not sure :) otherwise tomorrow morning :)
[16:54:07] <hnowlan>	 the petrol tank is running dry here too so no rush, I'll look at it whenever you're ready 
[16:55:57] <elukey>	 also I would be curious about what limit on open files we are getting
[16:56:09] <elukey>	 I mean, what thread hits it
[17:00:58] <hnowlan>	 yeah, good question 
[17:04:07] <hnowlan>	 afaict from reading the driver source, that message about "Using native clock" that appears is printed on *every* instantiation of a Cluster object so it really does seem like there's some kind of loop 
[17:04:11] <hnowlan>	 which ties with the threads idea 
[17:04:45] <joal>	 yeah - It's as if the job had problem contacting the instance and tried anew
[17:06:31] <wikibugs>	 10Analytics, 10SRE, 10netops: Audit analytics firewall filters - https://phabricator.wikimedia.org/T279429 (10ayounsi) @razzi from our IRC chat, the way I'd approach it is:  - for all the removed IPs, check if the host still exist, most of the cases it's just that the host is gone and the ACL never got updat...
[17:19:05] <hnowlan>	 another thing that is driving me to distraction is this: "[client-[aqs1010-a.eqiad.wmnet/10.64.0.88, /10.64.48.68, /10.64.16.206]]" - it comes up a lot, where there has been resolution of one host but not all and there's an empty string before the "/" (I am guessing) 
[17:20:54] <hnowlan>	 and also that "New Cassandra host aqs1012-b.eqiad.wmnet/10.64.32.145:9042 added" shows up but aqs1012-a never does 
[17:21:16] <hnowlan>	 I wonder if it would save us time to just completely reimage aqs1012 and see if anything changes *sigh* 
[17:31:13] <joal>	 I have no clue :S
[17:32:02] <elukey>	 hnowlan: I would +1 a quick reimage if it is not too much work for you, removing one variable from the table is surely good while we test
[17:32:34] <elukey>	 I mean we don't exactly know where the problem lies and we think that it could be a dirty host, if it is cheap lets wipe it
[17:33:02] <elukey>	 (i can help if you need a minion to do some work)
[17:58:44] <gmodena>	 milimetric the doc is mostly speculation at this point :). There's a companion proposal/braindump re adopting lambda architecture style of processing for some of our workloads https://docs.google.com/document/d/1-DLugMuUEFu8f3MyZVVEQJYv4SKBHN6rdgREPALoaKY/edit. I thought I included it in our airflow meeting notes, but I just realised I didn't. Sorry about that :(
[18:01:16] <milimetric>	 No worries gmodena: I think the speculation makes lots of sense so far, would love to work on it together with Chris
[18:19:52] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Radar, 10Metrics-Platform, 10Product-Data-Infrastructure, 10Vector (Vector (Tracking)): EventLogging revision popup gets hidden behind content in Vector - https://phabricator.wikimedia.org/T282550 (10Jdlrobson) I think it's still worth moving the element given the hi...
[18:56:05] <wikibugs>	 10Analytics: Import 2001 wikipedia data - https://phabricator.wikimedia.org/T155014 (10Graham87) I just found a particularly blatant example of a common username in the 2001 dump being later taken by a completely different user ... a good reason to be careful here! https://en.wikipedia.org/wiki/User:Aboyd_(2001_...
[19:07:45] <ottomata>	 mforns: vpv deploy?
[19:07:57] <mforns>	 ottomata: yes!
[19:08:20] <mforns>	 let me check the patch
[19:08:47] <ottomata>	 k
[19:08:53] <mforns>	 ottomata: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/689205
[19:08:58] <mforns>	 I think it's ready
[19:10:20] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Growth-Team, and 3 others: Revisions missing from mediawiki_revision_create - https://phabricator.wikimedia.org/T215001 (10Ottomata) https://gerrit.wikimedia.org/r/679353 has is out on all wikis in wmf.5!  After the next mediawiki history import we shoul...
[19:14:39] <mforns>	 ottomata: kafkacatting from cawiki (group0 I think)
[19:15:07] <joal>	 enough for today - see you tomorrow team
[19:15:38] <mforns>	 byeeeeeee joal 
[19:20:28] <ottomata>	 byyyeee <3 joal! :)
[19:20:30] <ottomata>	 mforns:  cool
[19:20:33] <ottomata>	 deployed
[19:22:33] <mforns>	 still seeing events com in
[19:22:36] <mforns>	 *come
[19:22:56] <mforns>	 IPs are there too
[19:26:53] <ottomata>	 mforns:  in new ep format e.g. with $schema and meat?
[19:26:54] <ottomata>	 meta*
[19:27:49] <mforns>	 ottomata: no meta yet
[19:28:07] <mforns>	 no $schema either
[19:28:21] <mforns>	 wait!
[19:29:15] <mforns>	 ottomata: yes, I see events with $schema, meta and client_ip!
[19:29:56] <mforns>	 no events come in without those now (grep -v meta)
[19:30:03] <ottomata>	 on cawiki?
[19:30:11] <mforns>	 oh, 1 came in in the old format now
[19:30:23] <ottomata>	 yeah might take a bit
[19:30:46] <mforns>	 cached JS?
[19:31:03] <mforns>	 still some of them in the old format
[19:32:13] <ottomata>	 ya
[19:32:23] <ottomata>	 cool looks great
[19:32:30] <ottomata>	 mforns:  let lett this back and then do all wikis tomorrow?
[19:32:31] <mforns>	 ottomata: in kafka by topic, the "Kafka bytes out by topic" metrics were previously 0 and now are seeing data
[19:32:34] <ottomata>	 prep a patch?
[19:32:40] <mforns>	 sure, was doing it
[19:32:44] <ottomata>	 :)
[19:32:44] <mforns>	 for tomorrow then?
[19:33:08] <mforns>	 is it normal that "Kafka bytes out by topic" metrics were previously 0?
[19:33:19] <ottomata>	 https://grafana.wikimedia.org/goto/kNVKLRqGk
[19:33:36] <ottomata>	 looking
[19:34:19] <mforns>	 how do you generate the short link?
[19:34:32] <ottomata>	 oh mforns  that is because you are consuming :p
[19:34:51] <mforns>	 oh, I got it https://grafana.wikimedia.org/goto/Y7fpYRqGk
[19:34:53] <ottomata>	 mfornsthe little graph looking sideways triangle near the title 
[19:34:54] <mforns>	 aaah! ofc
[19:35:16] <ottomata>	 hmm no thats not right
[19:35:21] <ottomata>	 camus hsould also cosume periodically
[19:35:33] <ottomata>	 i see 0 for days
[19:35:58] <mforns>	 hm
[19:36:36] <ottomata>	 oh
[19:36:38] <ottomata>	 mforns: its there
[19:36:43] <ottomata>	 https://grafana.wikimedia.org/goto/SUTJLgqMk
[19:36:53] <ottomata>	 its just more at the end
[19:36:59] <ottomata>	 since you are consuming
[19:37:13] <ottomata>	 or... is hard to see in zoomed out view?
[19:37:54] <mforns>	 I can see the data in the graph you passed, but not in the one in Kafka by topic
[19:38:01] <mforns>	 maybe the graph is not showing all
[19:38:24] <ottomata>	 if you change your time range to not inlude since you started consuming
[19:38:25] <ottomata>	 you can see it
[19:38:28] <ottomata>	 in spikes
[19:38:33] <ottomata>	 https://grafana.wikimedia.org/goto/lJlfYR3Gk
[19:39:29] <mforns>	 ottomata: yes yes, they are very small in comparison
[19:39:38] <mforns>	 thanks
[20:08:49] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 5 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Ottomata)
[21:20:03] <wikibugs>	 10Analytics, 10Discovery, 10Event-Platform, 10Platform Engineering, and 2 others: Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10Milimetric) p:05High→03Medium
[21:21:47] <wikibugs>	 10Analytics, 10Analytics-Kanban: Superset Presto LIMIT >10000 error - https://phabricator.wikimedia.org/T282632 (10Milimetric) a:05Milimetric→03JAllemandou
[21:26:12] <wikibugs>	 10Analytics, 10Analytics-Kanban: Stop Refining mediawiki_job events in Hive - https://phabricator.wikimedia.org/T281605 (10Milimetric) The description says we're keeping the raw JSON import, just not the rest of the pipeline.  I agree to delete any of it, unused data is just confusing, just making sure everyon...
[21:36:53] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10Milimetric) > Thanks, the [[ https://gerrit.wikimedia.org/r/admin/groups/ba706077efbf6c816f7f9bcd32d975e19aba7eb3 | wmde-qwerty ]] group would happily self-merge...
[21:42:17] <wikibugs>	 10Analytics, 10Analytics-Kanban: Crunch and delete many old dumps logs - https://phabricator.wikimedia.org/T280678 (10Milimetric) There's no need for a fancy tool, this would be a few lines of spark to read the data and save to, probably, a Hive table with an explicit schema.  Should take a day to set up and s...
[21:45:43] <wikibugs>	 10Analytics-Radar, 10Dumps-Generation: Temp files left around in wikistats_1/ ? - https://phabricator.wikimedia.org/T280311 (10Milimetric) > It takes up about 15G so honestly it's not that big a deal to keep around, even if there are only a few downloaders. I can't tell about our mirrors of course, but even fr...
[21:47:04] <wikibugs>	 10Analytics, 10Product-Analytics, 10wmfdata-python: wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10Milimetric) Ok, weird, I can't reproduce this... maybe it's some weird access problem?  We'll triage and look into it
[21:59:39] <wikibugs>	 10Analytics, 10Analytics-Wikistats: "Page views by edition of Wikipedia" for each country - https://phabricator.wikimedia.org/T257071 (10Milimetric) Thanks very much for following through with that.  Seeing your prototype makes it very clear what you need and why.  I think ideally we would create a better pipe...
[22:28:00] <wikibugs>	 10Analytics, 10Analytics-Wikistats: "Page views by edition of Wikipedia" for each country - https://phabricator.wikimedia.org/T257071 (10A455bcd9) Thanks.  By the way, after I made this prototype and wrote [[ https://adssx.substack.com/p/the-rise-of-the-rest | this article ]] about its results, I was asked to...
[22:52:15] <wikibugs>	 10Analytics, 10Analytics-Kanban: Superset Presto LIMIT >10000 error - https://phabricator.wikimedia.org/T282632 (10SNowick_WMF) Hi thanks for looking into this - I ran the query again on Superset and got the 11161, but when I ran it from Jupyter I got the higher result but I omitted the  `event.edit_tasks`, th...
[22:52:31] <wikibugs>	 10Analytics, 10Analytics-Kanban: Superset Presto LIMIT >10000 error - https://phabricator.wikimedia.org/T282632 (10SNowick_WMF) 05Open→03Resolved
[22:56:37] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Wikimedia-Developer-Portal, 10Documentation: Clean up EventLogging Schema: pages on meta - https://phabricator.wikimedia.org/T282584 (10Krinkle) I would not recommend it as it's imho counter the usual discovery path for this type of information...