[03:21:58] 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 3 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10Edtadros) a:05Edtadros→03ovasileva === Test Result - Prod **Stat... [03:22:15] 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 3 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10Edtadros) [03:22:49] 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 3 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10Edtadros) [05:12:54] elukey: Happy Monday! Do you know if it's possible to read data from the production AQS Cassandra instance from the AQS testing machines? Like does it have some IP address I can access from the testing machines [06:36:56] PROBLEM - Check the last execution of mediawiki-history-drop-snapshot on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki-history-drop-snapshot https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:59:38] lexnasser: hello! Sadly no, the only way is to make an export of data on say aqs1004 and then copy it to the cloud hosts via scp (to your laptop first, then to the testing machines) [06:59:47] there was some docs about how to do it [07:00:04] https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS#Load_data_into_cassandra_in_beta [07:00:12] I also need to update this doc with the new hostnames: D [07:17:16] joal: bonjour! Thanks for the data drop :) [07:28:43] Good morning elukey [07:29:29] * joal was afraid of the alarms and the 2.5Pb data bar crossed [07:30:49] we have 3.25PiB now so in theory crossing 2.5 is fine :D [07:31:08] the downside of having the partitions filled is that Yarn NM may become unhealthy :( [07:31:52] eah [07:36:00] elukey: I wonder if we could tell HDFS not to use the full partitions, keeping some space for NM? [07:37:23] joal: no idea, maybe there is a way [07:38:04] for the mw history drop, was the error the same as the one that I have added to alerts@? [07:38:24] IIRC we had the same issue once, and we had to force hive to check its partitions [07:38:35] but I am wondering if it is due to your data drop [07:39:58] elukey: indeed the error in alert is due to me manually dropping data - I'm sorry to have generated more alerts :( [07:40:40] ahhh nono this is completely fine, you resolved 4/5 of them so your karma balance is positive :) [07:44:36] elukey: also, I think I messed up yesterday as I tried to manually start 'refinery-drop-mediawiki-history-dumps.service', not 'mediawiki-history-drop-snapshot.service' [07:46:52] anyhow - fixing hive state so that the script runs correctly [07:47:28] !log Drop hive wmf.mediawiki_wikitext_history snapshot partitions (2020-08, 2020-09, 2020-10, 2020-11) [07:47:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:48:40] !log Manually start mediawiki-history-drop-snapshot.service to check the run succeeds [07:48:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:51:48] looks goood [07:55:30] RECOVERY - Check the last execution of mediawiki-history-drop-snapshot on an-launcher1002 is OK: OK: Status of the systemd unit mediawiki-history-drop-snapshot https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:56:01] elukey: \o/ - Dropping 1 partitions from wmf.mediawiki_wikitext_history [07:56:32] elukey: Had I manually started the correct job yesterday, issues would have been fixed without alert ;) [07:57:07] joal: you cannot pretend perfection on a sunday evening :) [07:57:22] eh, the important is the target :) [07:57:31] Maybe we could bump a little -setBalancerBandwidth in hdfs-balancer? [07:57:59] IIUC it is set to 40MB [08:06:12] given the size of our blocks, and the network pipes that we have now, we could maybe put 100/200 MB as test? [08:06:16] elukey: +1 [08:06:42] puppet change incoming [08:09:15] question for you elukey: the webrequest `dt` field is the timestamp in UTC (no timezone) at which a request starts being processed by varnish - right? [08:11:18] joal: rechecked, it is the end of the req [08:11:52] ack elukey - I was conviced the opposite - thanks for the recheck [08:20:36] elukey: thanks for the balancer update - Shall we manually restart the instance when it's deployed (it runs once every-day) [08:21:24] joal: I was about to ask - basically there is a run started on Feb 26th that it is still running :D [08:21:36] wow :) [08:21:38] so to get the new settings, we should probably stop that first [08:22:01] elukey: let me check to see if we can force max-time [08:22:11] good point [08:23:22] elukey: looks like it doesn't exist [08:23:31] elukey: let's kill and restart with the new settings [08:25:54] !log stop/start hdfs-balancer on an-launcher1002 with bw 200MB [08:25:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:32:15] joal: https://www.youtube.com/watch?v=Gs87KMFinYE&ab_channel=PrestoFoundation [09:32:27] fb has 40k presto nodes ahahahah [09:32:31] :D [09:33:30] maybe alluxio will be good also for the future kubeflow training cluster [09:33:42] elukey: also, I can see data movement being faster on HDFS [09:34:43] joal: do you mean block moves? [09:35:04] elukey: I mean datanodes usage averaging [09:35:33] ahhh [09:35:40] elukey: visual spread of nodes on UI is getting smaller [09:35:42] what metric are you checking? Curious [09:35:47] ack ack [09:36:09] elukey: this metric is easy to read, as you can see, but difficult to repdro :D [09:36:17] :D [09:36:51] I am running puppet on all the host for a big puppet clean up that I did on friday, no-op [09:36:58] ack elukey [09:36:58] very happy about it [09:37:00] \o/ [09:37:17] happy elukey usually transfers to happier joal :) [09:37:29] now I need to figure out a way to add the capacity scheduler config in a decent way in puppet, that was the original starting step :D [09:37:56] right [09:38:15] elukey: I have the task with the details open - will check later today [09:38:18] the research team is already experimenting running tensorflow on the cluster via conda-pack etc.., with labels we might unblock a lot of interesting use cases (maybe) [09:38:40] * joal hopes [09:38:42] joal: yesyes when you have time, I think that we'll need to book 30 mins and discuss over meet pros/cons etc.. [09:38:53] sure elukey - when you wish [09:39:15] when you have time, I don't want to keep postponing gobblin for you, ping me when you are "free-ish" :D [09:41:43] elukey: I'm currently running an analysis for pageviews from email to analytics - no gobblining yet [09:41:57] even worse then :D [09:44:45] thanks for the email follow up in alerts elukey :) [09:46:59] 10Analytics: Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10elukey) Nice video to keep in mind https://www.youtube.com/watch?v=Gs87KMFinYE&ab_channel=PrestoFoundation Let's remember Presto scheduler's soft affinity when testing this, since if splits are assigned to random... [09:53:20] :) [09:53:25] I did a small change in https://grafana-rw.wikimedia.org/d/000000585/hadoop?viewPanel=25&orgId=1 [09:53:36] the metric is in GB, previously it was listed in GiB [09:53:47] that always confuse me [09:54:01] now it should be more correct [09:54:14] super :) [10:05:59] (03CR) 10Tonina Zhelyazkova: "There is this note (but not an AC) in the task description that says: "We count only the edits an editor makes in a specific namespace. If" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/671195 (https://phabricator.wikimedia.org/T275999) (owner: 10Silvan Heintze) [12:12:28] * elukey lunch! [13:19:39] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 4 others: KaiOS / Inuka Event Platform client - https://phabricator.wikimedia.org/T273219 (10Ottomata) https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate#Producer_types:_Guaranteed_and_Hasty We prefer `hasty=true`... [13:34:16] 10Analytics, 10Event-Platform, 10Inuka-Team (Kanban): KaiOSAppFeedback Event Platform Migration - https://phabricator.wikimedia.org/T267345 (10Ottomata) Hmm, what about if you set `Content-Type: application/json`? [13:38:17] 10Analytics, 10Event-Platform, 10Research: TranslationRecommendation* Schemas Event Platform Migration - https://phabricator.wikimedia.org/T271163 (10Ottomata) Ok, that's good! Note that a 202 with `hasty=true` does not necessarily mean everything is working. See https://wikitech.wikimedia.org/wiki/Event_P... [13:43:34] 10Analytics, 10SRE: Upgrade to Kafka MirrorMaker 2 - https://phabricator.wikimedia.org/T277467 (10Ottomata) [13:43:42] 10Analytics, 10SRE: Upgrade to Kafka MirrorMaker 2 - https://phabricator.wikimedia.org/T277467 (10Ottomata) p:05Triage→03Low [13:45:31] 10Analytics, 10Event-Platform, 10Inuka-Team (Kanban): KaiOSAppFeedback Event Platform Migration - https://phabricator.wikimedia.org/T267345 (10SBisson) >>! In T267345#6913687, @Ottomata wrote: > Hmm, what about if you set `Content-Type: application/json`? Can you do that with `sendBeacon`? [13:46:07] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Marostegui) >>! In T120242#3963629, @Milimetric wrote: > We still have to check Debezium with the DBA... [13:48:44] 10Analytics, 10Event-Platform, 10Inuka-Team (Kanban): KaiOSAppFeedback Event Platform Migration - https://phabricator.wikimedia.org/T267345 (10Ottomata) No, but it should accept text/plain too, which sendBeacon does. It seems the lack of header at all is causing Express in EventGate to not parse the content. [13:54:00] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > Where is debezium supposed to run? In k8s. > We should keep primary masters as clean a... [13:55:41] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Joe) Reading the whole history here it seems that the problem we want to solve is a traditionally uns... [13:58:38] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Joe) >>! In T120242#6913765, @Ottomata wrote: >> Where is debezium supposed to run? > In k8s. > >>... [14:00:17] razzi: elukey FYI I moved our interview sync to tomorrow same time [14:00:53] 10Analytics, 10Event-Platform, 10Inuka-Team (Kanban): KaiOSAppFeedback Event Platform Migration - https://phabricator.wikimedia.org/T267345 (10SBisson) I'm using the same code to send client-side errors to 'kaios_app.error' and it works. Any significant differences between those streams? [14:04:54] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > Wait, so the solution to eventgate being occasionally unreliable is to use a mysql table... [14:07:03] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > We should focus on reconciliation strategies instead of chasing for panaceas for problems... [14:10:20] 10Analytics, 10Event-Platform, 10Inuka-Team (Kanban): KaiOSAppFeedback Event Platform Migration - https://phabricator.wikimedia.org/T267345 (10Ottomata) They do go to different endpoints, but no they should be the same. You on IRC/slack right now? Happy to help debug in realtime with ya. [14:12:40] 10Analytics-Radar, 10Machine-Learning-Team, 10SRE: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10fkaelin) I created a separate [[ https://docs.google.com/document/d/1Nffi3jUojC3BGNHkm2TyG7k5x30_7nzuPqgZ_tBeWNM/edit# | document ]] to discuss some of the bigger questions around orche... [14:26:35] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Marostegui) >>! In T120242#6913765, @Ottomata wrote: >> Where is debezium supposed to run? > In k8s.... [14:28:57] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Joe) >>! In T120242#6913810, @Ottomata wrote: >> We should focus on reconciliation strategies instead... [14:32:22] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > Because you will still have software failures, network partitions, error handling: this i... [14:40:52] 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Marostegui) >>! In T120242#6913869, @Ottomata wrote: >> Because you will still have software failures... [14:44:53] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Kormat) [14:59:34] Hi team - I'm sorry I didn't notice: there have time-change in US, meaniong standup is now, but I need to grab the kids from school :S [14:59:38] I'll b [14:59:41] again [15:00:01] I'll be back in 1h, and will plan on organizing better for the rest of the week [15:00:04] sorry for that [15:08:44] joal: I got bitten in a previous meeting as well, don't worry :) [15:20:19] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > Agree, I'd prefer to consume the binlog of a replica. >> Why not using this on c... [15:26:05] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Kormat) I had a quick look, as mariadb & mysql's GTID implementations are different and inco... [15:29:25] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) Relevant? https://debezium.io/documentation/reference/connectors/mysql.html#mysql... [15:32:53] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > Their roadmap says that they won't look at what's required to support mariadb un... [15:33:19] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Kormat) I don't think so; our mariadb clusters would count as a "Primary and replica" setup... [15:36:09] elukey: Thanks for the info about loading data into the testing Cassandra instance! Although I'm able to ssh into the aqs-test machines, I'm unable to ssh into deployment-aqs01.deployment-prep.eqiad1.wikimedia.cloud, with error ```channel 0: open failed: administratively prohibited: open failed [15:36:09] stdio forwarding failed [15:36:09] kex_exchange_identification: Connection closed by remote host```. Would you happen to know how to fix this? I found this: https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances#Connection_closed_by_remote_host , but none of the points seem to apply to me. Thanks! [15:37:32] elukey: yooho [15:37:35] 10Analytics, 10Machine-Learning-Team: Configure the Hadoop cluster to use the GPUs available on some workers - https://phabricator.wikimedia.org/T276791 (10fkaelin) I don't think splitting the GPU machines from the yarn cluster is a far fetched idea, especially given the hurdles of making this work with yarn -... [15:39:19] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Marostegui) >>! In T120242#6914026, @Ottomata wrote: >> Agree, I'd prefer to consume the bin... [15:39:34] 10Analytics-Clusters: Configure the HDFS Namenodes to use the log4j rolling gzip appender - https://phabricator.wikimedia.org/T276906 (10Ottomata) a:03elukey [15:40:50] 10Analytics-Clusters, 10Analytics-Kanban: Configure the HDFS Namenodes to use the log4j rolling gzip appender - https://phabricator.wikimedia.org/T276906 (10Ottomata) [15:43:10] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Review the Yarn Capacity scheduler and see if we can move to it - https://phabricator.wikimedia.org/T277062 (10Ottomata) a:03elukey [15:43:25] 10Analytics-Clusters, 10Analytics-Kanban, 10Technical-blog-posts: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop - https://phabricator.wikimedia.org/T277133 (10Ottomata) a:05srodlund→03elukey [15:46:41] 10Analytics, 10Product-Analytics: Hive Runtime Error - Query on event.MobileWikiAppDailyStats failing with errors - https://phabricator.wikimedia.org/T277348 (10fdans) a:03mforns [15:59:57] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 3 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10phuedx) [16:08:20] 10Analytics-Clusters: Review recurrent Hadoop worker disk saturation events - https://phabricator.wikimedia.org/T265487 (10Ottomata) a:05elukey→03None [16:11:42] 10Analytics, 10Product-Analytics: Hive Runtime Error - Query on event.MobileWikiAppDailyStats failing with errors - https://phabricator.wikimedia.org/T277348 (10SNowick_WMF) p:05Triage→03Medium [16:13:48] 10Analytics, 10Inuka-Team, 10Product-Analytics: Superset timeouts for KaiOS dashboard - https://phabricator.wikimedia.org/T277320 (10LGoto) p:05Triage→03Medium [16:17:27] 10Analytics, 10Event-Platform, 10Product-Data-Infrastructure, 10Product-Analytics (Kanban): [MEP] [BUG] Timestamp format changed in migrated client-side EventLogging schemas - https://phabricator.wikimedia.org/T277253 (10nettrom_WMF) The Product Analytics team are going to discuss this in our upcoming plan... [16:18:05] (03CR) 10Mholloway: "Almost there! One more question and comment inline." (033 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668244 (owner: 10Sharvaniharan) [16:18:11] (03CR) 10Mholloway: [C: 04-1] Image recommendations table for android [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668244 (owner: 10Sharvaniharan) [16:19:59] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) Interesting, makes sense. A LooONng time ago when I did MySQL DBA work, to rest... [16:23:13] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) Ah, if we needed to change the binlog position information used by Debezium, this... [16:25:27] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > We are in process of simplifying things to ease our operational load. I'm inter... [16:29:38] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Kormat) From reading through their docs a bit: * Debezium requires [[ https://debezium.io/d... [16:31:16] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > It also grabs a global read lock on the server it connects to when making an ini... [16:32:04] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > Debezium requires binlog_format=ROW, which means it cannot connect directly to a... [16:34:41] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10jcrespo) @Ottomata I am going to interject here, as backups owner. I have 2 needs regarding... [16:38:53] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) +1 @jcrespo this ticket is about solving the MW event production consistency probl... [16:47:09] 10Analytics, 10Machine-Learning-Team: Configure the Hadoop cluster to use the GPUs available on some workers - https://phabricator.wikimedia.org/T276791 (10elukey) @fkaelin I completely get your point, there is a bit of history behind the hadoop worker nodes with GPUs. They were bought when the ML team was not... [16:47:23] 10Analytics, 10Event-Platform, 10Product-Analytics, 10Product-Data-Infrastructure, 10Patch-For-Review: [MEP] [BUG] dt field in migrated client-side EventLogging schemas is not set to meta.dt - https://phabricator.wikimedia.org/T277330 (10mpopov) **Update**: this has been confirmed as a bug, since `dt` sh... [16:48:01] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Joe) Ok I'll try to re-summarize my argument: the problem we're trying to solve is having tr... [16:53:02] 10Analytics, 10Inuka-Team, 10Product-Analytics: Superset timeouts for KaiOS dashboard - https://phabricator.wikimedia.org/T277320 (10LGoto) a:03nshahquinn-wmf [16:53:56] !log rebalance kafka partitions for webrequest_upload partition 18 [16:54:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:58:00] ottomata: for Kafka partition migrations, I see the "Kafka Broker Under Replicated Partitions" alarm is set up to warn if there's 1 under-replicated partition and won't alert unless there's 10; should I schedule downtime to silence warnings or is it ok to have it warn? [17:03:07] 10Analytics-Radar, 10Machine-Learning-Team, 10SRE: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10elukey) >>! In T275551#6913820, @fkaelin wrote: > I created a separate [[ https://docs.google.com/document/d/1Nffi3jUojC3BGNHkm2TyG7k5x30_7nzuPqgZ_tBeWNM/edit# | document ]] to discuss... [17:03:38] * razzi afk for lunch [17:17:50] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 4 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10Krinkle) What is the motivation for moving the `mw.config.set`... [17:19:58] joal: when you have the chance, do you think you'd be able to merge https://gerrit.wikimedia.org/r/c/analytics/aqs/+/657228 and https://gerrit.wikimedia.org/r/c/analytics/refinery/+/668236 ? I don't have the proper permissions to do it myself, right? [17:20:11] also, who's ops week is it [17:20:16] lexnasser: Doing it now :) [17:20:28] lexnasser: can you please add deploy instruction to the ehterpad? [17:20:47] lexnasser: I can help with that if you wish :) [17:21:52] 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 3 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10LGoto) a:05ovasileva→03nray [17:21:52] Ah actually lexnasser, there is somethin we need to change first [17:22:11] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 4 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10Krinkle) >>! In T210106#6904086, @awight wrote: > * ResourceLo... [17:23:54] lexnasser: I'm gonna have diner with my family, then will be back - maybe we can recombine at that time [17:24:11] joal: sure thing, no rush. talk later! [17:24:11] lexnasser: my concern is with https://gerrit.wikimedia.org/r/c/analytics/aqs/+/657228/5/v1/pageviews.yaml#429 [17:25:09] basically lexnasser, this line is asking for AQS to check itself against data [17:25:20] And this is a bit not that simple :) [17:25:25] gone for diner, back after [18:11:51] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > no analysis of the causes of such inconsistencies is provided. Hm, I guess none... [18:14:09] joal: I'll be afk for the next hour or so - if that's too late for you, could you merge just https://gerrit.wikimedia.org/r/c/analytics/refinery/+/668236 , since it doesn't have any merge conflicts? we can totally handle the other task sometime else. also, my thought would be that that conflicting task should update the schema rather than vice versa, but I'm not set in stone on this [18:15:42] milimetric, joal, I did the test for your suggestion and the resulting percentiles look the same, and row count is ~500 times smaller :D [18:28:53] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10Cmjohnson) My morning got away from me and this is rescheduled for tomorrow 1400UTC (1000EST) [18:29:32] mforns: great news, and Iceberg should get us the rest of the way there on long-term performance [18:29:43] milimetric: sure! [18:30:47] yes, if we didn't use iceberg, it would start to timeout at 2 years of data or so [18:32:41] mforns: no loss in precision at all? [18:33:15] joal: it's difficult to say, because the granularity of the data set is so low... minutely [18:33:29] the percentiles in minutes that I tested are identical [18:33:45] but maybe if I do enough tests, I find some that are different [18:34:28] mforns: that's great :) [18:35:44] joal: maybe Presto is slick enough to do the approx sampling taking the weight into account, meaning doesn't sample rows 1:1 but considers weight when it comes to choosing the sampled one.. [18:36:12] that would make sense [18:39:55] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy this week - WARNING: Job not to be started before AQS is ready" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/668236 (https://phabricator.wikimedia.org/T207171) (owner: 10Lex Nasser) [18:42:12] * elukey afk! [18:44:32] 10Analytics, 10Event-Platform, 10Inuka-Team (Kanban): KaiOSAppFeedback Event Platform Migration - https://phabricator.wikimedia.org/T267345 (10SBisson) Same PR for all 3 schemas. [18:45:31] 10Analytics, 10Event-Platform, 10Inuka-Team (Kanban): InukaPageView Event Platform Migration - https://phabricator.wikimedia.org/T267344 (10SBisson) a:03SBisson Same PR for all 3 schemas: https://github.com/wikimedia/wikipedia-kaios/pull/348 [18:45:47] 10Analytics, 10Event-Platform, 10Inuka-Team (Kanban): KaiOSAppFirstRun Event Platform Migration - https://phabricator.wikimedia.org/T267346 (10SBisson) a:03SBisson Same PR for all 3 schemas: https://github.com/wikimedia/wikipedia-kaios/pull/348 [19:05:42] 10Analytics-Clusters, 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Bstorm) I went ahead and refreshed the view definitions on the host because there have been a few changes... [19:15:07] milimetric: o/ [19:15:17] looking at the data platform doc [19:15:30] i wonder if we shouldn't just say ETL instaed of Collect Process Serve? [19:15:40] althought guess serve is a bit more different than load [19:18:20] Heya - back I am [19:18:47] lexnasser: I'll be here until the end of the hour - let me know if you have a minute to talk about the AQS patches [19:19:35] I think ETL is more specific, because then people debate that it should be ELT and we're trying to be more abstract than that [19:19:42] ottomata: ^ [19:20:32] and Desiree was saying Tajh is using that Collect Process Serve terminology too, coincidentally [19:21:13] oh really! [19:21:15] joal: maybe you and ottomata want to talk about the doc, I'm on kid duty till later [19:21:17] ok let's keep it [19:21:31] oh ok yeah i'm trying to make the problem statement very clear [19:21:32] iterating [19:21:43] joal: also re-read your great Design Document - Data Architecture and Mediawiki [19:21:51] I liked all the iterations I saw so far :) [19:22:04] i think what we are trying to describe is basically that for the long term solution, at a higher less techincal level [19:22:17] hopefully to get aligment and commitment to that long term plan [19:22:36] ottomata, milimetric - Happy to talk :) [19:22:48] ottomata: want to pair-write? [19:23:03] joal ya gimme a few mins to finish my current thoought [19:23:16] ottomata: I'll be in the cave [19:23:43] milimetric: my salutations to Ada and Atlas :) [19:30:49] joal: I'm free for a bit to discuss AQS [19:34:22] Ah lexnasser - I've started a talk with Andrew :) [19:34:57] no worries, do you think you'll be free in 15 minutes? if not, no worries - we can chat tomorrow [19:35:44] lexnasser: I'll make it happen for 15 minutes [19:44:11] (03CR) 10Fdans: "@Milimetric the logic is not repeated though. The case statments are different from the ones in the daily job. These are encoding the days" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/658348 (https://phabricator.wikimedia.org/T265732) (owner: 10Fdans) [19:45:01] joal, milimetric: now that the session length intermediate table will have a count field, I wonder if it's better to change the query, instead of using the row_number partition trick, just use the method that we initially imagined... [19:45:45] sessions of length 1 = ticks1 - ticks2 [19:47:33] I guess the query would be equivalently complex [19:48:59] probably would need to join with itself after aggregating ticks to do the subtraction [19:49:45] although, self-join would happen with already very small data... [19:49:51] 10Analytics-Radar, 10Cassandra, 10ContentTranslation, 10Event-Platform, and 10 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10Eevans) [19:50:28] hmmm... I'm hesitant to change, because we'd loose the QA work that we have done?? [19:54:19] joal: free now? [19:54:24] Yes! [19:54:29] lexnasser: to the cave :) [19:57:28] (03PS3) 10Fdans: Add monthly pageview complete job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/658348 (https://phabricator.wikimedia.org/T265732) [20:08:06] (03CR) 10Joal: [C: 03+1] "Merging this for deploy this week" (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/657228 (https://phabricator.wikimedia.org/T207171) (owner: 10Lex Nasser) [20:08:41] (03CR) 10Joal: [C: 03+1] "Actually, merging after previous inline comment is fixed (wrong top-level comment sorry)" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/657228 (https://phabricator.wikimedia.org/T207171) (owner: 10Lex Nasser) [20:09:09] ottomata: I added a note about pinging lexnasser tomorrow when we dpeloy his patches to AQS [20:12:05] mforns: I'm all for keeping the query as is - the row_number OVER partition is definitely the most elegant approach IMO [20:13:16] joal: I'm testing right now, the other query, to see if the results are different [20:14:09] joal: hm... even if the query I'm testing is smaller it has 6 m-r steps, while the "row_number query" has only 4. [20:14:24] ack mforns - I'm interested in perf differences if there are any (I'm thinking the row_number one will be more efficient, let's triple check) [20:14:30] \o/ [20:14:33] k [20:16:34] ok gone for tonight - see you tomorrow team [20:16:39] byeeee :] [20:17:52] random question... is quarry maintained by this team? [20:18:59] never mind... I think I answered my own question [20:19:05] I'll have to go look at the source for my follow up questions [20:22:34] haha :) [20:22:46] nope, that is in Cloud VPS, which I think you found out :) [20:23:41] tltaylor: it's sort of loosely maintained by the cloud team, it was inspired by some code we wrote to compute cross-wiki metrics in another lifetime. But we have some thoughts on the future of such access to our data [20:24:04] I would imagine. does it use AQS [20:24:24] https://phabricator.wikimedia.org/T215858 [20:24:32] https://phabricator.wikimedia.org/T204950 [20:29:01] aha [21:42:44] 10Analytics, 10Product-Analytics: Hive Runtime Error - Query on event.MobileWikiAppDailyStats failing with errors - https://phabricator.wikimedia.org/T277348 (10SNowick_WMF) Also note I am getting the same stderr for a different query on table wmf.mediawiki_history: ` SELECT COUNT(1) AS n_cumulative_total,... [21:42:48] 10Analytics, 10Better Use Of Data: Optimize intermediate session length data set and dashboard - https://phabricator.wikimedia.org/T277512 (10mforns) [21:46:43] 10Analytics, 10Better Use Of Data: Optimize intermediate session length data set and dashboard - https://phabricator.wikimedia.org/T277512 (10mforns) I think solving #2 will be enough for the dashboard to perform fine for several months, maybe a couple years. I'm already working on that. It should be a small c... [21:47:16] (03PS1) 10Mforns: Optimize data format of session length intermediate table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/672541 (https://phabricator.wikimedia.org/T277512) [21:49:38] 10Analytics, 10Better Use Of Data, 10Patch-For-Review: Optimize intermediate session length data set and dashboard - https://phabricator.wikimedia.org/T277512 (10mforns) @cchen once we merge and deploy the optimization above, the session length dashboard will need some adjustments. Maybe we can set up a meet... [21:53:01] (03CR) 10Mforns: [V: 03+2] "I tested this with real data at 10% sampling, and the results are 500 smaller in #rows than the current table. Plus, the approx_percentile" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/672541 (https://phabricator.wikimedia.org/T277512) (owner: 10Mforns) [23:34:37] 10Analytics-EventLogging, 10Analytics-Radar, 10Contributors-Team, 10MobileFrontend: Schema:MobileWebEditing: What are commons sorts of errors? - https://phabricator.wikimedia.org/T118366 (10Jdlrobson) 05Open→03Invalid Schema is inactive so presumably this task is now invalid. [23:55:50] 10Analytics, 10Event-Platform, 10Research: TranslationRecommendation* Schemas Event Platform Migration - https://phabricator.wikimedia.org/T271163 (10bmansurov) Indeed. Those schema tables are all empty. I visited the eventgate-validation dashboard on Logstash, but I couldn't find any such requests. Where ca...