[00:40:15] 10Analytics, 10Patch-For-Review: Newpytyer python spark kernels - https://phabricator.wikimedia.org/T272313 (10Ottomata) PR for wmfdata: https://github.com/wikimedia/wmfdata-python/pull/22 [00:44:53] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10Ottomata) @nshahquinn-wmf are you using the ssh terminal or the Notebook Terminal? In the ssh terminal I can't reproduce: ` 00:43:04 [@stat1008:/home/otto] 1 $ source conda-ac... [00:48:23] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10Ottomata) @fkaelin Ah ah indeed it should. I think this is a pip thing. In my previous tests conda lets me install newer versions of things into my conda env. For pip, it... [00:56:44] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10Ottomata) Added: https://wikitech.wikimedia.org/wiki/User:Ottomata/Jupyter#pip_fails_to_install_a_newer_version_of_a_package [00:59:11] PROBLEM - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:43:35] 10Analytics-Radar, 10Data-Persistence (Consulted), 10Platform Engineering Roadmap Decision Making, 10Epic, and 3 others: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Marostegui) Correct, this is how the table looks like now: ` root@cumin1001:/home/maro... [07:01:16] good morning! [07:01:59] Since I see sqoop MR jobs in yarn, I think that worker reimages (so in place upgrade stretch -> buster) can be stopped until mw history is done [07:02:06] in the meantime, I'll focus on the new worker ndoes [07:02:07] *nodes [07:10:39] 10Analytics-Clusters: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1120.eqiad.wmnet', 'an-worker1121.eqi... [07:35:32] 10Analytics-Clusters: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1120.eqiad.wmnet', 'an-worker1121.eqiad.wmnet'] ` and were **ALL** successful. [07:37:48] 10Analytics: Turnilo split thresholds too low - https://phabricator.wikimedia.org/T276192 (10Gilles) [07:38:57] 10Analytics: Turnilo split thresholds too low - https://phabricator.wikimedia.org/T276192 (10Gilles) [07:40:10] 10Analytics-Clusters: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1122.eqiad.wmnet', 'an-worker1123.eqi... [07:50:59] (03CR) 10Awight: "recheck" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/666933 (https://phabricator.wikimedia.org/T273454) (owner: 10Awight) [07:51:10] (03CR) 10Awight: "recheck" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/667565 (https://phabricator.wikimedia.org/T272569) (owner: 10Awight) [08:01:28] !log manual start of performance-asotranking on stat1007 (requested by Gilles) - T276121 [08:01:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:01:31] T276121: asoranking timer failed on stat1007 - https://phabricator.wikimedia.org/T276121 [08:02:14] * elukey bbiab [08:06:20] RECOVERY - Check the last execution of performance-asoranking on stat1007 is OK: OK: Status of the systemd unit performance-asoranking https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:07:13] 10Analytics-Clusters: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1122.eqiad.wmnet', 'an-worker1123.eqiad.wmnet', 'an-worker1124.eqiad.wmnet'] ` and wer... [08:09:43] 10Analytics-Clusters: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1125.eqiad.wmnet', 'an-worker1126.eqi... [08:36:34] 10Analytics-Clusters: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1125.eqiad.wmnet', 'an-worker1126.eqiad.wmnet', 'an-worker1127.eqiad.wmnet'] ` and wer... [08:37:51] Good morning - Thanks elukey for letting sqoop finish gently :0 [08:37:54] :) [08:38:21] 10Analytics-Clusters: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1128.eqiad.wmnet', 'an-worker1130.eqi... [08:38:30] PROBLEM - Check the last execution of performance-asoranking on stat1007 is CRITICAL: CRITICAL: Status of the systemd unit performance-asoranking https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:38:43] joal: bonjour! I am close to have 10 new workers ready to be added to the cluster :) [08:39:02] The growing list of stuff to get with sqoop doesn't make the job faster obviously - Splitting the main job in smaller ones would help [08:39:07] elukey: \o/ [08:39:43] I made a mistake in the racking task for the last 6 nodes, now we have a rack (A4) with 8 nodes, so I am trying to ask to dcops to move them around if possible [08:40:11] but it should be solvable in few days hopefully [08:40:22] Arf elukey - While not optimal it's alos a non-blocker IMO - Let's do as ou prefer :) [08:40:52] joal: yeah I know, but now that the hosts are not running anything etc.. it is easier to move them if needed :( [08:40:59] and 8 nodes down at the same time it is not great [08:41:02] of course I get that [08:41:05] (rack power failure etc..) [08:41:06] isgh [08:41:08] *sigh [08:41:29] but we are getting more and more racks filled up, I am a little worried for the long term reliability [08:41:42] maybe we could try to do some chaos monkey tests [08:41:46] stopping datanodes [08:41:53] in a controlled way I mean [08:42:04] 2 4 6 8 10 [08:42:17] to simulate even a row failure [08:42:33] elukey: why not :) [08:42:59] elukey: the usual question comes next: what do we don't do if we do that :D [08:43:28] joal: ??? :D [08:43:37] ahhhh okok [08:43:41] sorry now i get it [08:44:05] As in, what else do we have that will be pushed if we do that one [08:44:05] I probably need a coffee [08:44:07] :) [08:44:10] I have some [08:44:19] * joal shares some virtual coffee with elukey [08:45:08] :) [08:45:26] joal: so today's specials from the ops menu are [08:45:39] 1) apply cache settings to druid public [08:45:50] 2) ad 10/12 new buster nodes to the hadoop cluster [08:46:03] lemme know if you have concerns [08:47:27] elukey: no concern :) [08:47:59] elukey: I'll have a bottle of this nice gobbl-wine with the special please :) [08:49:22] ahhaha ack! [08:49:39] the druid metrics are really nice afaics (for the analytics cluster) [08:51:23] indeed elukey [08:51:35] * joal is eager to look at those metrics for the public cluster :) [08:55:49] * elukey too [09:05:24] 10Analytics-Clusters: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1128.eqiad.wmnet', 'an-worker1130.eqiad.wmnet', 'an-worker1131.eqiad.wmnet'] ` and wer... [09:30:21] joal: https://gerrit.wikimedia.org/r/c/operations/puppet/+/666598 when you have a moment [09:30:44] the plan that I have in mind for public is [09:30:51] 1) deploy the puppet change on all nodes [09:31:08] 2) roll restart the brokers (so they'll start using only query cache) [09:31:22] 3) roll restart historicals, one a the time, slowly, to enable segment query caching [09:55:33] 10Analytics-Clusters, 10SRE, 10vm-requests: Eq: new Druid test VM for analytics - https://phabricator.wikimedia.org/T266771 (10akosiaris) 05Open→03Resolved This seem to be done by the move in the workboard from Backlog to Done. Feel free to reopen though! [09:55:59] 10Analytics-Clusters: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1119.eqiad.wmnet'] ` The log can be f... [10:03:22] sorry elukey, I was in a meeting [10:04:21] np! Sorry for the extra pings :( [10:04:54] all good for me elukey - let's change that public cluster :) [10:14:13] !log roll restart druid brokers on druid public to pick up new cache settings (no segment caching, only query caching) [10:14:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:18:32] 10Analytics-Clusters: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1119.eqiad.wmnet'] ` and were **ALL** successful. [10:21:16] !log roll restart druid historicals on druid public to pick up new cache settings (enable segment caching) [10:21:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:22:39] 10Analytics: Turnilo split thresholds too low - https://phabricator.wikimedia.org/T276192 (10JAllemandou) I have found this issue: https://github.com/allegro/turnilo/issues/472 From my tests I can get up to 100 values, but it depends on the dimensions by which the cube is split, and the chart format. I have man... [10:32:07] 10Analytics: Turnilo split thresholds too low - https://phabricator.wikimedia.org/T276192 (10Gilles) 05Open→03Invalid Ah, yes, 100 is hardcoded, so I guess we'll see 100 countries at least. Thanks for that link, it let me find the drop-down menu that I didn't know existed to override the default split limit... [10:38:04] ok druid public restarted, but I don't see cache metrics for the broekrs [10:38:09] *brokers [10:40:42] elukey: query response-time is getting higher (as expected with no cache) [10:42:16] joal: yep seems so, wikistats looks very fast though, I am a little confused [10:42:30] rechecked the settings and I should have enabled the right tunables [10:43:17] also the historicals cache are warming up, let's wait a bit for latencies, they should improve [10:43:40] ack elukey - I thought historical were not yet restarted [10:43:45] my bad [10:44:37] joal: interesting, if you check the caffeine values after the restarts are not zero, there is a low volume of entries/sec [10:44:46] for the brokers I mean [10:45:16] latencies are improving a lot [10:46:38] elukey: batcave for a minute? [10:46:48] joal: sure [11:12:42] RECOVERY - Check the last execution of performance-asoranking on stat1007 is OK: OK: Status of the systemd unit performance-asoranking https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:15:14] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10nshahquinn-wmf) >>! In T224658#6873259, @Ottomata wrote: > @nshahquinn-wmf are you using the ssh terminal or the Notebook Terminal? This was in the notebook terminal. I just tr... [11:16:40] elukey: I don't see cached object for brokers on public cluster - could we have messed up config? [11:19:15] on the hosts the settings are [11:19:16] druid.broker.cache.populateCache=false [11:19:16] druid.broker.cache.populateResultLevelCache=true [11:19:16] druid.broker.cache.useCache=false [11:19:16] druid.broker.cache.useResultLevelCache=true [11:19:26] that in theory is what we want [11:19:38] but it is weird indeed [11:20:04] elukey: could it be the metric not being correctly updated to a weird status of metriv-updater? [11:20:47] joal: I am reading https://druid.apache.org/docs/latest/querying/caching.html#query-caching-on-brokers, maybe we are missing some parameter [11:20:56] Whole-query result level caching is controlled by the parameters useResultLevelCache and populateResultLevelCache and runtime properties druid.broker.cache.*. [11:21:15] this is what confused me about druid every time, 100 options [11:21:20] elukey: maybe we're missing a setting that the analytics-cluster has? [11:21:29] like cache-size? but that feels weird [11:22:11] https://druid.apache.org/docs/latest/configuration/index.html#broker-caching is very generic [11:22:20] doesn't discriminate between segment/query cache [11:22:47] joal: checked and there is druid.cache.sizeInBytes=2147483648 [11:22:53] ack [11:24:30] the only thing that I see is "Segment-level caching is controlled by the parameters useCache and populateCache." [11:30:37] :( [11:34:07] joal: let's see after lunch, there may be some follow up to do, but metrics looks good [11:34:22] for sure elukey it looks all good :) [11:34:31] even if query time broker -> historical went up a bit, the query time for historicals went down and stabilized as well [11:34:41] (as the cache grows [11:34:51] all right ttl :) [11:34:54] * elukey lunch! [11:35:17] * joal wonders how long of a TTL elukey is set for [11:41:47] time to lunch? [12:35:04] * klausman lunch and router PSU replacement [13:28:57] (03CR) 10Joal: Update UA-Parser to 1.5.2 (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/667717 (https://phabricator.wikimedia.org/T272926) (owner: 10Milimetric) [13:30:29] 10Analytics, 10Machine-Learning-Team, 10SRE: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10elukey) [13:32:28] 10Analytics, 10Machine-Learning-Team, 10SRE: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10elukey) @fkaelin we discussed this during our grooming session and we decided to pause the efforts for Kubeflow until we'll know that this is the technology/stack that we'll use. We'll know f... [13:33:42] 10Analytics, 10Machine-Learning-Team, 10SRE: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10elukey) p:05Medium→03Triage [13:35:38] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) Pausing this for a few days to let the MW history jobs to complete :) [13:38:06] 10Analytics, 10Machine-Learning-Team, 10SRE: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10Joseph) Hi @fkaelin, I think you tagged wrong Joseph. [13:42:20] !log Add an-worker11[19,20-28,30,31] to Analytics Hadoop [13:42:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:50:37] will run puppet on 1119 first, then to the others, in small batches [13:50:52] I want to make sure that puppet runs fine, users are deployed correctly, etv.. [13:50:59] * joal is hitting f5 on the yarn scheduler interface [13:51:28] /usr/lib/hadoop-hdfs/bin/hdfs: line 319: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java: No such file or directory [13:51:35] /o\ [13:51:46] I think we are missing a dependency in puppet :D [13:51:51] :) [13:54:24] ok 1119 added [13:55:50] proceeding with 1120 [13:57:46] users looks good, all uids are as expected [13:59:30] 10Analytics, 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech-focus: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10fgiunchedi) random-ish update re: checkpoint storage after a chat with @Zbyszko: the current situation... [14:15:48] joal: something interesting from druid - the caffeine broker metric is hit+misses, and it looks really matching the correspondent cache misses (generic, non caffeine metric) of misses [14:22:50] in other news, we just crossed 3PBs :) [14:22:59] hehehe :) [14:24:12] 73 worker nodes [14:24:45] 9,45Tb RAM :) [14:24:58] Almost 4k cores :) [14:31:19] wow :) [14:46:22] TIL: link-reccomendation in kubernetes fetches data from thorium (https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/) [14:46:36] *recommendation [14:57:26] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Data-Infrastructure, and 2 others: MEP Client MediaWiki PHP - https://phabricator.wikimedia.org/T253121 (10Mholloway) [14:57:55] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Data-Infrastructure, and 5 others: EventLogging PHP EventServiceClient should use EventBus->send(). - https://phabricator.wikimedia.org/T272863 (10Mholloway) 05Open→03Resolved a:03Mholloway [14:58:35] Hi mforns - Let me know when you wish we sync on ops stuff :) [15:12:51] hi teamm! [15:13:14] joal: yes, we didn't do it last week, whenever you want! [15:13:24] mforns: now? [15:13:37] mforns: maybe after ou read emails :) [15:13:37] joal: ok! omw [15:13:53] joal: no no, it's fine :] [15:13:59] ack - to the cave [15:18:41] 10Analytics, 10Event-Platform, 10Product-Analytics, 10Product-Data-Infrastructure: [MEP] [BUG] Timestamp format changed in migrated server-side EventLogging schemas - https://phabricator.wikimedia.org/T276235 (10mpopov) [15:29:41] 10Analytics, 10Patch-For-Review: Dropping data from druid takes down aqs hosts - part 2 - https://phabricator.wikimedia.org/T270173 (10elukey) [15:34:52] I hate computers. [15:35:08] klausman: maybe the opposite is true as well? [15:35:27] This is quite possible. [15:35:44] 10Analytics, 10Event-Platform, 10Product-Analytics, 10Product-Data-Infrastructure: [MEP] [BUG] Timestamp format changed in migrated server-side EventLogging schemas - https://phabricator.wikimedia.org/T276235 (10Ottomata) Oh that is a problem, I saw your original message Morten and thought ok so timezone i... [15:35:56] It would definitely explain the cuts on my hands and the sanity loss of something as simple as rplacing a PSU. [15:36:24] Datacenter Ops people at Google used to have t-shirts that said "My other computer is made of razorblades and hate." [15:39:46] lol [15:39:51] 10Analytics, 10Patch-For-Review: Newpytyer python spark kernels - https://phabricator.wikimedia.org/T272313 (10Isaac) Just wanted to chime in and say thanks for making this PR @Ottomata -- I probably won't getting around to testing it in the next week or two. I don't want to hold you up though because I know y... [15:56:26] 10Analytics, 10Event-Platform, 10Product-Analytics, 10Product-Data-Infrastructure: [MEP] [BUG] Timestamp format changed in migrated server-side EventLogging schemas - https://phabricator.wikimedia.org/T276235 (10mpopov) >>! In T276235#6875046, @Ottomata wrote: > Oh that is a problem, I saw your original me... [16:08:28] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) 05Open→03Resolved Thanks a lot, will follow up in a new task! [16:18:39] ottomata: do you have time? [16:19:17] 10Analytics, 10ops-eqiad: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10elukey) [16:19:36] 10Analytics, 10ops-eqiad: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10elukey) [16:19:58] 10Analytics, 10Patch-For-Review: Add 6 worker nodes to the HDFS Namenode config of the Analytics Hadoop cluster - https://phabricator.wikimedia.org/T275767 (10elukey) 05Open→03Stalled Blocked until T276239 is solved [16:20:00] 10Analytics-Clusters: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10elukey) [16:20:47] 10Analytics-Clusters: Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster - https://phabricator.wikimedia.org/T274795 (10elukey) [16:22:41] (03PS2) 10Joal: [WIP] Fix wikitext history job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657053 (https://phabricator.wikimedia.org/T269032) [16:26:00] 10Analytics, 10Patch-For-Review: Dropping data from druid takes down aqs hosts - part 2 - https://phabricator.wikimedia.org/T270173 (10elukey) We have deployed a new cache config in both clusters, with the hope that the new scheme will help when dropping data. High level details: * segment caching disabled o... [16:26:21] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Dropping data from druid takes down aqs hosts - part 2 - https://phabricator.wikimedia.org/T270173 (10elukey) [16:26:29] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Dropping data from druid takes down aqs hosts - part 2 - https://phabricator.wikimedia.org/T270173 (10elukey) a:03elukey [16:27:17] * elukey bbiab [16:49:12] 10Analytics, 10Patch-For-Review: Newpytyer python spark kernels - https://phabricator.wikimedia.org/T272313 (10fkaelin) I second Isaac`s comment. I reviewed the gh PR and tested successfully. [16:54:16] 10Analytics, 10Product-Infrastructure-Team-Backlog, 10Wikimedia Taiwan, 10Chinese-Sites, 10Pageviews-Anomaly: Top read is showing one page that had fake traffic in zhwiki - https://phabricator.wikimedia.org/T274605 (10ssingh) Hi @JAllemandou: thanks, I think proceeding through Wikimedia Taiwan as @Shizha... [16:56:56] 10Analytics: OutOfMemory error when querying mediawiki_wikitext_history - https://phabricator.wikimedia.org/T231373 (10awight) [16:59:55] 10Analytics: SLF4J logspam when using hadoop command-line clients - https://phabricator.wikimedia.org/T276240 (10awight) [17:02:32] fdans: standuyP! [17:07:12] 10Analytics, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: WDCM_Sqoop_Clients.R fails from stat1004 - https://phabricator.wikimedia.org/T274866 (10elukey) @GoranSMilovanovic I just applied the last patch for sqoop, it shouldn't change anything from your side, but it is what up... [17:26:31] !log rebalance kafka partitions for webrequest_upload partition 7 [17:26:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:28:51] 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 2 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10Jdlrobson) [17:35:55] 10Analytics-Radar, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: update welcome survey aggregation schedule - https://phabricator.wikimedia.org/T275172 (10MMiller_WMF) [17:36:07] 10Analytics-Radar, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: update welcome survey aggregation schedule - https://phabricator.wikimedia.org/T275172 (10MMiller_WMF) [17:36:09] 10Analytics-Radar, 10Growth-Scaling, 10Product-Analytics, 10Growth-Team (Current Sprint): Growth: shorten welcome survey retention to 90 days - https://phabricator.wikimedia.org/T275171 (10MMiller_WMF) [17:36:24] 10Analytics-Radar, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: update welcome survey aggregation schedule - https://phabricator.wikimedia.org/T275172 (10MMiller_WMF) [17:36:26] 10Analytics-Radar, 10Growth-Scaling, 10Product-Analytics, 10Growth-Team (Current Sprint): Growth: shorten welcome survey retention to 90 days - https://phabricator.wikimedia.org/T275171 (10MMiller_WMF) [17:47:27] 10Analytics, 10FR-Tech-Analytics, 10Fundraising-Backlog: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage - https://phabricator.wikimedia.org/T273246 (10mpopov) Just closed that task to remove the instrumentation and updated the migration status of the WikipediaPortal schema in... [17:48:48] (03CR) 10Ebernhardson: "any chance to get this through sooner than later? We haven't been able to run any processes to prune data for privacy policy reasons since" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667274 (owner: 10Ebernhardson) [17:52:57] ping mforns on this one --^ we could be deploying this today maybe? [17:53:22] sure joal :] [18:00:19] (03PS1) 10Gerrit maintenance bot: Add trv.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667905 (https://phabricator.wikimedia.org/T276246) [18:08:16] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667905 (https://phabricator.wikimedia.org/T276246) (owner: 10Gerrit maintenance bot) [18:13:07] 10Analytics, 10Product-Analytics: Can't re-run failed Oozie workflows in Hue/Hue-Next (as non-admin) - https://phabricator.wikimedia.org/T275212 (10nshahquinn-wmf) Adding the ACL made it possible for me to manage the job from the command line, but I still can't manage them from the Hue interface. [18:13:57] 10Analytics, 10Product-Analytics (Kanban): Big increase in traffic for projects except 'wikipedia' family since Feb 14th - https://phabricator.wikimedia.org/T274823 (10LGoto) [18:14:31] 10Analytics, 10Product-Analytics (Kanban): Big increase in traffic for projects except 'wikipedia' family since Feb 14th - https://phabricator.wikimedia.org/T274823 (10LGoto) p:05Triage→03Medium [18:16:19] 10Analytics-Radar, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics (Kanban): Growth: update welcome survey aggregation schedule - https://phabricator.wikimedia.org/T275172 (10LGoto) [18:27:48] joal mforns: chat about uaparser? [18:27:54] sure milimetric [18:28:00] milimetric: sure, bc? [18:28:03] yep [18:31:00] 10Analytics-Clusters, 10Analytics-Kanban: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi) [18:32:46] 10Analytics, 10SRE, 10ops-eqiad: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10wiki_willy) a:03Cmjohnson [18:35:44] razzi: https://github.com/apache/superset/issues/13396 - awesome report :) [18:35:57] Thanks elukey! :) [18:36:34] razzi: I am wondering one thing - should we push more for https://phabricator.wikimedia.org/T263972 ? [18:37:01] the title is wrong now that I think about it, it has been removed [18:37:09] but the charts haven't been converted [18:38:14] maybe there is a way via mysql to move the charts on the db, without requiring users to do so [18:38:23] or to come up with a list of users so we can contact them [18:38:50] elukey: yeah, that seems like a good thing to work on [18:40:06] maybe after superset 1.1 is out, I wish there was a way to ask user to do so [18:43:08] 10Analytics, 10Product-Analytics (Kanban): Big increase in traffic for projects except 'wikipedia' family since Feb 14th - https://phabricator.wikimedia.org/T274823 (10kzimmerman) @cchen can you summarize the findings from you & @JAllemandou here, for future reference? My understanding is that you didn't find... [18:43:28] elukey: how do you feel about going over my backlog tomorrow? I'd say today but I want to organize it first :) [18:44:36] razzi: yep anytime! Send me a meeting invite (we can do even 1h with a breaks so I'll stop talking periodically :D) [18:50:52] ottomata: I need a +2 on this to get it on the train, but it's not life or death, if you're around: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/667717 [18:51:04] ottomata: two things to keep in mind [18:51:18] (Joseph said he's ok with your decision here) [18:52:32] 1. for now we're not updating the python version, we think that's ok because you plan to deprecate the eventlogging processor soon. If you disagree, I'll update that to point to the same uap-core [18:53:05] (03CR) 10Ottomata: [C: 03+2] Update UA-Parser to 1.5.2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/667717 (https://phabricator.wikimedia.org/T272926) (owner: 10Milimetric) [18:53:19] 2. (doesn't matter 'cause you just merged it, hahaha) [18:53:25] did i imerge? [18:53:33] you told me you wanted +2 [18:53:33] ? [18:53:34] it's source so it'll merge :) [18:53:38] oh [18:53:43] that's fine! [18:53:44] if you agree [18:53:46] milimetric: maybe we should check with performance team / gilles about python ua parser [18:54:18] I'm sending the uap-python project a PR to bump the core version anyway, I might as well update that part too, will do that [18:54:27] milimetric: [18:54:27] https://github.com/wikimedia/puppet/blob/production/modules/webperf/manifests/navtiming.pp#L27 [18:54:34] ah, good to know, thx [18:54:41] no reason to bother them, I'll just update [18:55:01] k [18:58:00] ottomata: while reading https://github.com/criteo/tf-yarn, I found https://github.com/criteo/tf-yarn#configuring-the-python-interpreter-and-packages [18:58:27] that references https://github.com/criteo/cluster-pack [18:58:48] (03Merged) 10jenkins-bot: Update UA-Parser to 1.5.2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/667717 (https://phabricator.wikimedia.org/T272926) (owner: 10Milimetric) [18:58:49] not sure if you already went through it in the past, might be a useful reading [18:58:50] oh interesting! [18:58:53] no i haven't seen that [18:58:56] ah nice :) [18:59:03] the code i just did for wmfdata uses conda-pack, which it looks like this uses [18:59:08] reading.. (after meetingg...) [18:59:45] exactly yes I thought you'd have been interested :) [19:03:03] * elukey afk! have a good rest of the day folks :) [19:14:15] * razzi afk for lunch [19:38:41] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667274 (owner: 10Ebernhardson) [19:41:56] joal or milimetric: can you please review this https://gerrit.wikimedia.org/r/c/analytics/refinery/+/664885 ? If so, I'll deploy it today as well [19:42:36] reading mforns [19:42:46] thank you! :] [19:44:19] also joal I've seen all your comments in https://gerrit.wikimedia.org/r/c/analytics/refinery/+/654924 have been taken care no? Should I deploy? [19:45:42] mforns: it can go - I wanted to review the HQL with the errors, but this no big deal [19:46:09] ok! can you +2? [19:46:13] sure [19:46:25] thx :) [19:47:00] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654924 (https://phabricator.wikimedia.org/T207171) (owner: 10Lex Nasser) [19:47:15] Done mforns [19:47:24] also, milimetric: is this something to deploy? https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/663300 [19:47:26] mforns: Let's not start the job though - AQS is not ready [19:47:30] thanks a lot joal [19:47:36] yep, makes sense [19:48:02] mforns: I didn't have time to write that up on the train pad, I'll just deploy manually later, no worries [19:48:30] milimetric: are you sure, I haven't started yet! [19:54:35] mforns: yeah, no worries, I got my 1/1 now [19:54:43] and there's other stuff they're waiting on anyway [19:56:09] ok, leaving it there, thanks! [20:04:06] oh, but there's nothing on the train etherpad :) [20:04:15] I'll review that change now, so you have something [20:05:52] (03CR) 10Joal: "A bunch of comments" (036 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664885 (https://phabricator.wikimedia.org/T273116) (owner: 10Mforns) [20:06:10] mforns: I have some comments on our patch - let's wait for next week if ok for ou?> [20:07:47] joal: sure I was going to say that! [20:08:10] joal: maybe I'll address your comments this week and do an extra refinery-only deployment for that [20:08:17] but not now [20:08:59] 10Analytics, 10SRE: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10Ottomata) p:05Triage→03Low [20:09:23] mforns: then there's nothing to deploy this week, I'll do a manual deploy of refinery-source with my change /restart webrequest later [20:09:45] milimetric: there are other patches [20:10:05] milimetric: see train etherpad: https://etherpad.wikimedia.org/p/analytics-weekly-train [20:10:23] (03CR) 10Milimetric: [C: 03+2] "small enough, self-merging" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663300 (owner: 10Milimetric) [20:10:50] ok, I'll add that :] [20:10:57] ah! mforns it was just mislabeled :) [20:11:02] 02-02 [20:11:05] I was confused [20:11:15] ah! [20:11:26] oh, sorry [20:12:12] milimetric: that change, does it need jar version bump ups? if so, where? [20:12:38] mforns: yes, I'll submit and merge the bumps and list them on the etherpad I guess, and if you don't get to do the restarts I'll do them later [20:13:16] milimetric: if you want, I can do them, I was planning to do the ones for the uaParser change [20:14:03] I got it mforns, no worries, doing it now [20:14:13] which BTW, should be webrequest-load, Refine, and learning ones no? [20:15:01] uh... you're ahead of me, but only webrequest-load uses the UDF, I was searching source now [20:15:24] oh! ofc, we checked that actor was not using it... my bad [20:15:41] isn't Refine using it? for EventPlatform? [20:17:10] mforns: hm... it looks like refine relies on the EventLogging python-generated UA map?! [20:17:22] oh! [20:17:24] (03Merged) 10jenkins-bot: Make null result same shape as normal result [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/663300 (owner: 10Milimetric) [20:17:46] https://github.com/wikimedia/analytics-refinery-source/blob/cdaa08a990e37d9136b30d401ee76627ba431156/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/TransformFunctions.scala#L389 [20:18:23] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Newpyter - SWAP Juypter Rewrite - https://phabricator.wikimedia.org/T224658 (10Ottomata) > This was in the notebook terminal. I just tried in a regular SSH terminal, and I don't get these errors there. Ah ha, I see why. Thanks this is def a bug in the co... [20:18:44] milimetric: that is for legacy integration only [20:19:01] ah, so ottomata where does refine parse UA strings? [20:19:08] milimetric: https://github.com/wikimedia/analytics-refinery-source/blob/cdaa08a990e37d9136b30d401ee76627ba431156/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/TransformFunctions.scala#L269 [20:19:33] https://github.com/wikimedia/analytics-refinery-source/blob/cdaa08a990e37d9136b30d401ee76627ba431156/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/TransformFunctions.scala#L301-L305 [20:19:34] yup [20:20:18] ah, GetUAPropertiesUDF [20:20:18] for legacy eventlogging tables, refine is filling in the old useragent struct field with the values from user_agent_map, which are parsed from the ua string [20:20:40] bit confusing we have two UDFs doing roughly the same thing [20:21:44] so then, yeah, mforns, just two things: webrequest-load and refine [20:21:57] milimetric: ok! [20:24:29] 10Analytics, 10Product-Infrastructure-Team-Backlog, 10Wikimedia Taiwan, 10Chinese-Sites, 10Pageviews-Anomaly: Top read is showing one page that had fake traffic in zhwiki - https://phabricator.wikimedia.org/T274605 (10JAllemandou) @ssingh : The Wikimedia Taiwan group on Phab seems very inactive (few acti... [20:24:30] is this how we update refinery-source version for refine? https://github.com/wikimedia/puppet/blob/7c025f42ac2d792bcd0dac524cb46bf1fe43faed/modules/profile/manifests/analytics/refinery/job/refine.pp#L43 [20:24:53] milimetric: yup [20:24:58] milimetric two UDFs? [20:25:31] milimetric: you can also bump up the version for the test cluster [20:25:39] there's UAParserUDF and GetUAPropertiesUDF, they do slightly different things for different reasons, but it'd be nice to be able to find both of them so renaming them to some common root [20:28:04] I noticed a few executors being terminated due to memory limits. The jobs are not memory intensive and were running without issues before the hadoop update. Did anybody see this too? I didn't investigte much, the jobs still succeed in the end. [20:28:20] ```ExecutorLostFailure (executor 36 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 9.0 GB of 9 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714. ``` [20:28:31] milimetric: ah ok [20:29:27] milimetric:  [20:29:49] I think they do exactly the same thing, UAParserUDF is just the old deprecated name (https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-hive/src/main/java/org/wikimedia/analytics/refinery/hive/UAParserUDF.java#L27) [20:30:21] fkaelin: I have not experienced this particular type of errors since we have upgraded [20:30:50] (03PS1) 10Milimetric: webrequest/load: bump refinery version to 0.1.2 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667931 [20:31:18] fkaelin: we have our first run of mediawiki-history (big spark job) later those days, and I manually run a big wikitext conversion job - I'll tell you if I hit similar problem [20:31:26] (03CR) 10Milimetric: [V: 03+2 C: 03+2] webrequest/load: bump refinery version to 0.1.2 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667931 (owner: 10Milimetric) [20:32:20] mforns: do these need it too: https://github.com/wikimedia/puppet/blob/e85e3f165814c509d0a30f0c97207fa30c43ec8a/modules/profile/manifests/analytics/refinery/job/test/druid_load.pp ? [20:33:03] milimetric: I don't think druid loading jobs are looking at useragent parser no? [20:33:18] I meant using [20:33:35] I donno, ok, I'll leave it alone [20:33:47] I think you're right mforns [20:34:42] k, so we just need our SRE overlords to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/667930 [20:35:36] wait, not yet!~ [20:35:59] I have to deploy both source and refinery [20:36:30] in any case, thanks a lot for the help milimetric :D [20:36:35] (03PS3) 10Joal: Fix wikitext history job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657053 (https://phabricator.wikimedia.org/T269032) [20:36:54] mforns: if your deploy has not yet started --^ [20:36:55] milimetric: is there any bump up needed for https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/663300/ [20:37:06] joal: looking! [20:39:32] (03PS1) 10Joal: Bump jar version of wikitext oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667932 (https://phabricator.wikimedia.org/T269032) [20:39:44] mforns: if the above is correct, ou have the following :) --^ [20:40:40] joal: the code makes sense to me, has it been tested? [20:41:03] (03PS1) 10Milimetric: mediacounts and mediarequests: bump refinery source to 0.1.2 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667933 [20:41:05] mforns: correct sit - successful run on a previsouly failed instance this afternoon [20:41:15] s/sit/sir/ [20:41:16] cool! will merge [20:41:17] (03CR) 10Milimetric: [V: 03+2 C: 03+2] mediacounts and mediarequests: bump refinery source to 0.1.2 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667933 (owner: 10Milimetric) [20:41:28] mforns: etherpad updated [20:41:32] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657053 (https://phabricator.wikimedia.org/T269032) (owner: 10Joal) [20:42:02] thanks a lot for the help milimetric and joal :DDDD [20:42:24] oof, it's a lot to deploy sorry [20:42:29] yeah :( [20:43:28] thanks a lot mforns for deploying all that! [20:43:35] milimetric, joal: I think the deployment plan looks ready now, agree? [20:43:37] please ping if you wish help :) [20:43:47] thanks :] [20:43:52] yes, looks good to me [20:43:56] Looks good to me mforns - I have read lines and they make sense [20:44:06] k, thx! [20:45:41] (03Merged) 10jenkins-bot: Fix wikitext history job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657053 (https://phabricator.wikimedia.org/T269032) (owner: 10Joal) [20:48:52] (03PS1) 10Mforns: Update changelog.md for v0.1.2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/667936 [20:49:28] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Self-merging for deployment train." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/667936 (owner: 10Mforns) [20:59:15] 10Analytics-Radar, 10Growth-Scaling, 10Growth-Team (Current Sprint), 10Product-Analytics (Kanban): Growth: update welcome survey aggregation schedule - https://phabricator.wikimedia.org/T275172 (10nettrom_WMF) [21:39:26] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.1.2 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667942 [21:40:45] Heya mforns :) from that patch it seems you've forgotten to log :) [21:41:13] mforns: I'm planning to stop working - anything I cou;d help with before logging off? [21:41:31] joal: for refinery-source I always only log at the end, given there's nothing that can go wrong [21:41:36] "nothing" [21:41:40] nothing that can break stuff [21:41:46] :] [21:41:49] ack mforns - my bad then :) [21:42:05] no joal, I've had some internet interruptions these last 10 minutes [21:42:21] I will continue to deploy, but maybe I will restart jobs tomorrow morning [21:42:33] no probs :) I'll be around [21:42:38] ok [21:42:41] :] [21:42:48] have a good night [21:46:34] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Merging to deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667942 (owner: 10Maven-release-user) [21:46:53] byeee [21:48:17] !log deployed refinery-source v0.1.2 [21:48:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:59:15] !log starting refinery deployment using scap [21:59:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:14:20] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Sanitize and ingest all event tables into the event_sanitized database - https://phabricator.wikimedia.org/T273789 (10Ottomata) [22:22:59] 10Analytics, 10Event-Platform, 10Product-Analytics, 10Product-Data-Infrastructure, 10MW-1.36-notes (1.36.0-wmf.32; 2021-02-23): [MEP] [BUG] Timestamp format changed in migrated server-side EventLogging schemas - https://phabricator.wikimedia.org/T276235 (10Mholloway) Verified that timestamps are now comi... [22:32:22] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Merging for deployment train." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/667932 (https://phabricator.wikimedia.org/T269032) (owner: 10Joal) [23:15:50] !log finished deployment of refinery to hdfs [23:15:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log