[01:44:21] <wikibugs>	 (03PS1) 10GoranSMilovanovic: T239201 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/670630
[01:44:47] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T239201 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/670630 (owner: 10GoranSMilovanovic)
[04:57:58] <wikibugs>	 10Analytics, 10Product-Infrastructure-Team-Backlog, 10Wikimedia Taiwan, 10Chinese-Sites, 10Pageviews-Anomaly: Top read is showing one page that had fake traffic in zhwiki - https://phabricator.wikimedia.org/T274605 (10Htchien) Hi @MSantos, I have gave your email to the CHT MOD team, they should contact y...
[06:46:54] <elukey>	 good morning
[07:30:34] <wikibugs>	 10Analytics, 10Technical-blog-posts: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop - https://phabricator.wikimedia.org/T277133 (10elukey)
[07:30:46] <elukey>	 let's do it --^ 
[07:35:14] <wikibugs>	 10Analytics: Check home/HDFS leftovers of dedcode - https://phabricator.wikimedia.org/T276748 (10MGerlach) @elukey  that sounds good. thanks
[07:41:51] <wikibugs>	 10Analytics: Check home/HDFS leftovers of dedcode - https://phabricator.wikimedia.org/T276748 (10elukey) @MGerlach I created on the stat boxes `/home/mgerlach/dedcode_home`,  and changed file ownership permission to your username, lemme know if you can read files etc..  I am going to proceed to drop hdfs and hiv...
[07:55:58] <wikibugs>	 10Analytics, 10Product-Infrastructure-Team-Backlog, 10Wikimedia Taiwan, 10Chinese-Sites, 10Pageviews-Anomaly: Top read is showing one page that had fake traffic in zhwiki - https://phabricator.wikimedia.org/T274605 (10JAllemandou) Hi @Htchien and @MSantos, I have been contacted by a person working at CHT...
[08:15:56] <elukey>	 !log hdfs dfs -rmr /user/dedcode on an-launcher1002 (data in trash for a month) - T276748
[08:15:59] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:15:59] <stashbot>	 T276748: Check home/HDFS leftovers of dedcode - https://phabricator.wikimedia.org/T276748
[08:25:46] <elukey>	 !log drop database dedcode cascade in hive - T276748
[08:25:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:25:49] <stashbot>	 T276748: Check home/HDFS leftovers of dedcode - https://phabricator.wikimedia.org/T276748
[08:25:59] <elukey>	 it seems that now cascade cleans up all the dirs on hdfs as well
[08:26:31] <wikibugs>	 10Analytics: Check home/HDFS leftovers of dedcode - https://phabricator.wikimedia.org/T276748 (10elukey) 05Open→03Resolved a:03elukey All cleaned up! Please re-open if needed :)
[08:26:39] <wikibugs>	 10Analytics: Check home/HDFS leftovers of dedcode - https://phabricator.wikimedia.org/T276748 (10MGerlach) >>! In T276748#6903530, @elukey wrote: > @MGerlach I created on the stat boxes `/home/mgerlach/dedcode_home`,  and changed file ownership permission to your username, lemme know if you can read files etc.....
[08:44:03] <klausman>	 My wrist/forearm is acting up again. Taking it easy today.
[08:46:58] <elukey>	 ack klausman :)
[09:23:45] <elukey>	 I am reading https://blog.cloudera.com/yarn-capacity-scheduler/
[09:24:07] <elukey>	 there is a very nice thing called "routing", so users can be targeted to queues automatically
[09:24:37] <elukey>	 for example, we could say that 'analytics' needs to run in the 'production' queue by default, so if we forget when launching an oozie job it will not be an issue
[09:25:22] <elukey>	 there is also a tool called fs2cs to come up with a starting config from a fair scheduler one, but available only for hadoop 3
[09:32:19] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:34:47] <wikibugs>	 10Analytics: Check home/HDFS data of Bernd Sitzmann - https://phabricator.wikimedia.org/T273712 (10dr0ptp4kt) Yes, please.
[09:35:22] <wikibugs>	 10Analytics-Radar, 10Cassandra, 10ContentTranslation, 10Event-Platform, and 10 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10akosiaris)
[09:37:53] <joal>	 elukey: maybe we could use fs2cs in a fake hadoop3 cluster, and try to backport config?
[09:39:16] <elukey>	 joal: bonjour! In theory we can get the yarn package somewhere and install it, even in a vm, it is a cli toop
[09:39:19] <elukey>	 *tool
[09:39:39] <elukey>	 but I am worried that it will use yarn3 specific things
[09:40:00] <elukey>	 in theory it shouldn't be super difficult to come up with a config now, as you were saying we have few queues
[09:40:57] <elukey>	 also there is elasticity between min/max requirements, I think that we can come up with a config in few time (hopefully)
[09:41:00] <elukey>	 what do you think?
[09:41:32] <elukey>	 something like: production min 20% max 60%, essential etc.. (may be a leaf of production)
[09:56:46] <elukey>	 ah so the fs2cs spins up a FairScheduler instance, so it will create a Yarn3 one
[09:59:02] <elukey>	 there is a very nice example in https://www.youtube.com/watch?v=kYBKQmBrAgg
[10:02:22] <wikibugs>	 10Analytics, 10SRE, 10observability: Set up cross DC topic mirroring for Kafka logging clusters - https://phabricator.wikimedia.org/T276972 (10fgiunchedi) >>! In T276972, @Ottomata wrote: > Our multi DC kafka setup works like this: > - Producers prefix topics with their datacenter name, e.g. eqiad.mediawiki....
[10:06:19] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:24:35] <wikibugs>	 10Analytics: Review the Yarn Capacity scheduler and see if we can move to it - https://phabricator.wikimedia.org/T277062 (10elukey) Video from the ApacheCon about the fs2cs tool (https://www.youtube.com/watch?v=kYBKQmBrAgg), that it is available from Yarn 3. The tool spins up a FairScheduler instance to work, so...
[10:24:47] <elukey>	 joal: added some thoughts in --^
[10:34:39] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 4 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10awight) >>! In T210106#6900374, @phuedx wrote: > You could hav...
[10:44:03] <elukey>	 I am going to use the docker trunk image for bigtop to build hadoop 3.x pkgs, then I'll install it in a container and will try to run the fs2cs too
[10:44:06] <elukey>	 *tool
[10:49:36] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Radar, 10Front-end-Standards-Group, 10MediaWiki-extensions-WikimediaEvents, and 4 others: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 (10awight) >>! In T210106#6900374, @phuedx wrote: > However there...
[10:51:17] <wikibugs>	 10Analytics: Check home/HDFS data of Bernd Sitzmann - https://phabricator.wikimedia.org/T273712 (10elukey) 05Open→03Resolved a:03elukey All dropped, thanks!
[10:52:21] <elukey>	 !log drop /home/bsitzmann on all stat100x hosts - T273712
[10:52:24] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:52:24] <stashbot>	 T273712: Check home/HDFS data of Bernd Sitzmann - https://phabricator.wikimedia.org/T273712
[13:21:51] <mforns>	 heya teammm
[13:22:29] <elukey>	 hola hola
[13:24:12] <wikibugs>	 10Analytics-Radar, 10Cassandra, 10ContentTranslation, 10Event-Platform, and 10 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10MSantos)
[14:13:03] <elukey>	 mmmm fs2cs is not even in 3.2.2
[14:13:04] <elukey>	 sigh
[14:13:26] <joal>	 elukey: I'm reviewing the task and reading detailed docs of scheduler
[14:13:34] <joal>	 elukey: I will send an update soon
[14:14:04] <elukey>	 joal: does it make sense what I wrote?
[14:14:15] <joal>	 elukey: my point is: given the simplicity of our setup, I think it's easier for us to explicitely decide for a config and make it
[14:14:21] <joal>	 elukey: it does indeed!
[14:14:35] <joal>	 elukey: My comment starts with: thanks for the very nice prep work :)
[14:14:51] <elukey>	 <3
[14:17:46] <mforns>	 fdans: do you think we can put the traffic anomalies meetings on hold until sukhbir comes back?
[14:51:05] <wikibugs>	 10Analytics: Review the Yarn Capacity scheduler and see if we can move to it - https://phabricator.wikimedia.org/T277062 (10JAllemandou) Thanks for the nice prep work @elukey :)  > We could use FIFO for the sequential queue, and FAIR for the rest +1 > I would also allow the 100% value only for the production and...
[14:53:21] <wikibugs>	 10Analytics: Review the Yarn Capacity scheduler and see if we can move to it - https://phabricator.wikimedia.org/T277062 (10elukey) @JAllemandou very nice and clean, I like it, going to prep a puppet code change to start testing it :)
[14:58:18] <joal>	 elukey: I forgot some stuff in my comment about sheduler - Please give me until tonight before you start your patch please :)
[14:59:40] <elukey>	 joal: I am kicking it off anyway just to have a baseline, it is easy to change it later on
[14:59:46] <joal>	 ack :)
[15:01:41] <wikibugs>	 (03CR) 10Mforns: "Code looks good overall! Cleaner than before and generic." (039 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/670321 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata)
[15:17:05] <wikibugs>	 (03CR) 10Jhernandez: [C: 04-2] "Very useful info. The request to https://noc.wikimedia.org/conf/dblists/all.dblist is quite small (11kb) and it would be fairly easy to ge" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/668544 (owner: 10Jhernandez)
[15:28:49] <wikibugs>	 (03CR) 10Jhernandez: multiinstance: Attempt to make quarry work with multiinstance replicas (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm)
[15:33:14] <razzi>	 Hi all
[15:33:19] <elukey>	 good morning
[15:34:02] <razzi>	 !log rebalance kafka partitions for webrequest_upload partition 17
[15:34:02] <razzi>	 18 / 48 partitions!
[15:34:03] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:34:32] <elukey>	 nice :)
[15:35:51] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 1:" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/668544 (owner: 10Jhernandez)
[15:36:05] <wikibugs>	 10Analytics, 10DC-Ops, 10SRE, 10ops-eqiad: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10Cmjohnson) @wiki_willy This server is out of warranty by 1 year (purchased 2017) I can probably find a used one in our decom servers. Let me know if this is how you want t...
[15:36:11] <wikibugs>	 10Analytics, 10DC-Ops, 10SRE, 10ops-eqiad: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10Cmjohnson) a:03wiki_willy
[15:42:14] <wikibugs>	 10Analytics, 10Event-Platform, 10Continuous-Integration-Config: Jenkins-bot does not submit changes on passing gate-and-submit for /schemas/event/* repos - https://phabricator.wikimedia.org/T277051 (10hashar) You are welcome, thank you to have confirmed the fix!
[15:44:21] <wikibugs>	 10Analytics, 10Machine-Learning-Team: Configure the Hadoop cluster to use the GPUs available on some workers - https://phabricator.wikimedia.org/T276791 (10Miriam) Just a follow-up on a few use cases from the Research team.   In most cases, when we train machine learning models, the typical pipeline is the fol...
[15:48:07] <wikibugs>	 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on analytics1059 - https://phabricator.wikimedia.org/T276696 (10Cmjohnson) a:03elukey I replaced the disk it's in an unconfigured state: Can you add it back to the raid?  Firmware state: Unconfigured(good), Spun Up
[15:51:44] <elukey>	 razzi: --^
[15:52:00] <elukey>	 do you want to try to re-add the disk?
[15:52:13] <elukey>	 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk
[15:52:29] <razzi>	 elukey: yeah, let me give it a go
[15:57:59] <razzi>	 Following the steps for swapping broken disk, analytics1059 is a hadoop worker node, so disks should look like https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Standard_Worker_Installation_(12_disk,_2_flex_bay_drives_-_analytics1058-analytics1077,_most_of_an-worker10XX_except_the_ones_equipped_with_a_GPU). Logging in to check status via megacli
[16:00:06] <elukey>	 razzi: just added a note at the end, refresh the page when you have a moment
[16:00:41] <elukey>	 (going out for a quick errand, I'll read in a bit)
[16:00:46] * elukey afk! bbiab!
[16:21:29] <razzi>	 elukey: whenever you're back, the disk for analytics1059 is: `Firmware state: Unconfigured(good), Spun Up`, which is neither Configured(good) nor Unconfigured(bad)
[16:22:01] <razzi>	 Seems to be closer to Unconfigured(bad) but I'll wait to proceed
[16:25:45] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Ottomata)
[16:26:56] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Ottomata)
[16:29:08] <elukey>	 razzi: you can proceed, it is probably a typo in the doc
[16:29:17] <elukey>	 consider it as good
[16:29:34] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Ottomata) Solutions?  A. Restructure wgEventStreams to be keyed by stream name.  I think doing this would not be so difficult, but we'd lose...
[16:31:02] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Ottomata)
[16:34:03] <razzi>	 elukey: cool. Looks like I'm on to editing commands, would you like to meet and watch over my shoulder?
[16:34:30] <elukey>	 razzi: it is fine, you can write in here if you want
[16:34:45] <razzi>	 alright, going to `megacli -CfgForeign -Clear -a0`
[16:35:24] <razzi>	 since `megacli -CfgForeign -Scan -a0` gave `There are 1 foreign configuration(s) on controller 0.`
[16:35:53] <elukey>	 ok
[16:46:01] <razzi>	 No preserved cache to cleared; proceeding with `sudo megacli -CfgLdAdd -r0 [32:3] -a0`
[16:46:07] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Ottomata)
[16:50:38] <elukey>	 razzi: new disk appeared in dmesg, good :)
[16:51:31] <wikibugs>	 10Analytics-Radar, 10SRE, 10ops-eqiad: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson)
[16:55:26] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10mforns) Could we key the config by stream name, and have an extra key "regexp_streams" (better name to be found) that contains an integer-in...
[16:57:21] <razzi>	 ok cool, looks like disk in question is sde, but I don't see it in fdisk, should it be there elukey?
[16:58:23] <elukey>	 razzi: there is, fdisk -l shows it at the bottom
[16:59:04] <razzi>	 ok whew good :)
[17:00:11] <razzi>	 now something strange, parted is showing "command not found"
[17:00:23] <elukey>	 razzi: yes ok to install it
[17:12:54] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Ottomata) > No idea if that fulfills the requirements for stream config discovery I dunno, maybe we should just get rid of the regex stream...
[17:38:49] <wikibugs>	 10Analytics, 10DC-Ops, 10SRE, 10ops-eqiad: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10wiki_willy) a:05wiki_willy→03Cmjohnson Hi @cmjohnson - it sounds like they need it in production.  @elukey  or @Ottomata - let us know if there's a particular decom'd...
[17:43:03] <wikibugs>	 10Analytics, 10DC-Ops, 10SRE, 10ops-eqiad: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10elukey) @wiki_willy any decommed host is fine!
[17:45:09] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 2 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10Edtadros) === Test Result - Beta  **Status:** ✅ PASS **Environment:**...
[17:48:34] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 2 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10Edtadros)
[17:49:44] <wikibugs>	 10Analytics: Inconsistent systemd default task max on hadoop workers - https://phabricator.wikimedia.org/T274860 (10elukey) 05Open→03Resolved Closing this for the moment, since we have the same kernel everywhere now and the defaults seem ok.
[17:50:25] <wikibugs>	 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on analytics1059 - https://phabricator.wikimedia.org/T276696 (10elukey) a:05elukey→03razzi
[17:57:23] <wikibugs>	 10Analytics, 10CirrusSearch, 10SRE, 10Wikidata, and 4 others: Upgrade prometheus-jmx-exporter - https://phabricator.wikimedia.org/T276595 (10Ottomata) Hello!  Does Analytics have to upgrade too? :)
[17:57:38] <wikibugs>	 10Analytics-Clusters, 10CirrusSearch, 10SRE, 10Wikidata, and 4 others: Upgrade prometheus-jmx-exporter - https://phabricator.wikimedia.org/T276595 (10Ottomata)
[17:58:35] <wikibugs>	 10Analytics-Radar, 10SRE, 10observability: Set up cross DC topic mirroring for Kafka logging clusters - https://phabricator.wikimedia.org/T276972 (10Ottomata)
[17:58:52] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10Ottomata)
[18:00:26] <wikibugs>	 10Analytics, 10PM: Fix Analytics workflow for #Analytics-EventLogging tasks - https://phabricator.wikimedia.org/T274490 (10Ottomata) @Aklapper whatever you think is best here is fine with us.  Adding a Herald rule to auto-tag Analytics is fine.
[18:00:57] <wikibugs>	 10Analytics-Clusters, 10SRE, 10ops-eqiad: Degraded RAID on analytics1059 - https://phabricator.wikimedia.org/T276696 (10Ottomata)
[18:08:02] <wikibugs>	 10Analytics-Clusters: Configure the HDFS Namenodes to use the log4j rolling gzip appender - https://phabricator.wikimedia.org/T276906 (10Ottomata)
[18:11:32] <wikibugs>	 10Analytics-Clusters, 10Data-Persistence-Backup: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10Ottomata)
[18:11:45] <wikibugs>	 10Analytics-Clusters: Review the Yarn Capacity scheduler and see if we can move to it - https://phabricator.wikimedia.org/T277062 (10Ottomata)
[18:12:22] <wikibugs>	 10Analytics-Clusters, 10Technical-blog-posts: Story idea for Blog: Migration of the Analytics Hadoop infrastructure to Apache Bigtop - https://phabricator.wikimedia.org/T277133 (10Ottomata)
[18:24:16] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 4 others: KaiOS / Inuka Event Platform client - https://phabricator.wikimedia.org/T273219 (10Ottomata)
[18:34:07] <elukey>	 razzi: let's finish the analytis1059 disk, do you have a moment?
[18:34:26] <razzi>	 elukey: yep!
[18:34:38] <elukey>	 fine to install parted, we may want to add it into puppet too
[18:35:11] <razzi>	 !log apt-get install parted on analytics1059
[18:35:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:36:52] <razzi>	 I'll follow step 7 in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk:
[18:36:52] <razzi>	 ```
[18:36:52] <razzi>	 sudo parted /dev/sde --script mklabel gpt
[18:36:52] <razzi>	 sudo parted /dev/sde --script mkpart primary ext4 0% 100%
[18:36:52] <razzi>	 sudo mkfs.ext4 -L hadoop-d /dev/sde1
[18:36:53] <razzi>	 sudo tune2fs -m 0 /dev/sde1
[18:36:53] <razzi>	 ```
[18:37:39] <elukey>	 razzi: hadoop-d no bueno, hadoop-e :)
[18:37:48] <razzi>	 whoops! yep
[18:40:01] <razzi>	 ok, all done, and I see the commented out file system in /etc/fstab
[18:42:32] <razzi>	 elukey: looks like I should update the fstab entry to what I see in `lsblk -i -fs` for sde1
[18:42:48] <razzi>	 the uuid, that is
[18:42:58] <elukey>	 yep!
[18:43:11] <elukey>	 so the labels are a more convenient way
[18:43:21] <elukey>	 but we don't have it deployed everywhere
[18:43:28] <elukey>	 the init worker cookbook uses them
[18:43:41] <elukey>	 but for example when I reimaged all the nodes UUID came back
[18:43:49] <elukey>	 by default after debian install
[18:43:56] <elukey>	 the partitions do have labels anyway
[18:44:16] <elukey>	 in this case, 1059 is an old node without them
[18:44:30] <razzi>	 I did see the hadoop-e label which was comforting
[18:44:49] <elukey>	 ok so now we need to mount it
[18:45:19] <elukey>	 this will cause puppet to re-add it to the hadoop hdfs/yarn confs, so you'll also have to restart them after running puppet
[18:45:29] <elukey>	 so they'll pick up the new disk
[18:46:04] <razzi>	 `mount -a`?
[18:46:11] <elukey>	 yep it is fine
[18:46:19] <razzi>	 done
[18:46:41] <elukey>	 ack, so df -h shows it correctly, as expected empty
[18:46:44] <elukey>	 now puppet run
[18:47:00] <elukey>	 and systemctl restart hadoop-yarn-nodemanager and hadoop-hdfs-datanode
[18:47:10] <elukey>	 log everything and then close the task :)
[18:47:22] <elukey>	 (let's not keep it open since it is in the dcops queue as well)
[18:47:22] <razzi>	 cool
[18:47:30] <elukey>	 any doubt/question?
[18:47:52] <razzi>	 what is the effect of running puppet here?
[18:48:42] <razzi>	 ok I'm able to see the result, /var/lib/hadoop/data/e is configured to work with haddop
[18:48:45] <razzi>	 'hadoop
[18:48:52] <elukey>	 so in profile::hadoop::common we have
[18:48:52] <elukey>	     # The datanode mountpoints are retrieved from facter, among the list of mounted
[18:48:56] <elukey>	     # partitions on the host. Once a partition is not available anymore (disk broken for example),
[18:48:59] <elukey>	     # it is sufficient to run puppet to update the configs (and restart daemons if needed).
[18:49:02] <elukey>	     $all_partitions = $facts['partitions'].map |$device, $partition_metadata| { $partition_metadata['mount'] }
[18:49:05] <elukey>	     $datanode_mounts = $all_partitions.filter |$partitions| { $datanode_mounts_prefix in $partitions }
[18:49:14] <elukey>	 this is why you see the new path configured
[18:49:20] <elukey>	 razzi: --^
[18:50:03] <razzi>	 ok cool. That makes sense
[18:50:20] <elukey>	 if you want to add these bits to the docs please :)
[18:50:32] <elukey>	 any time you see something not there, take some time to update
[18:50:45] <razzi>	 !log systemctl restart hadoop-yarn-nodemanager on analytics1059
[18:50:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:50:49] <elukey>	 it is annoying I know but yourself from the future will say thank you 
[18:51:27] <razzi>	 yep :)
[18:51:35] <elukey>	 razzi: I am stepping away a bit but if you want to do matomo please go ahead
[18:51:57] <razzi>	 yeah, I'll restart matomo, seems low risk. Thanks elukey! Have a good evening
[18:52:09] <elukey>	 have a good rest of the day :)
[18:52:52] <razzi>	 !log systemctl restart hadoop-hdfs-datanode on analytics1059
[18:52:53] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:53:26] <wikibugs>	 10Analytics-Clusters, 10CirrusSearch, 10SRE, 10Wikidata, and 4 others: Upgrade prometheus-jmx-exporter - https://phabricator.wikimedia.org/T276595 (10colewhite) >>! In T276595#6905896, @Ottomata wrote: > Hello!  Does Analytics have to upgrade too? :)  The updated jar will be deployed to to our apt repo whi...
[18:58:49] <wikibugs>	 10Analytics-Clusters, 10SRE, 10ops-eqiad: Degraded RAID on analytics1059 - https://phabricator.wikimedia.org/T276696 (10razzi) 05Open→03Resolved Disk is added to raid. Thanks @Cmjohnson for doing the replacement.
[19:29:07] <wikibugs>	 10Analytics-Clusters: Review the Yarn Capacity scheduler and see if we can move to it - https://phabricator.wikimedia.org/T277062 (10JAllemandou) I forgot some settings I think would be interesting for us when reading the capacity-scheduler docs:  **To set** * `yarn.scheduler.capacity.<queue-path>.minimum-user-l...
[19:54:15] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 3 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10Mholloway) From my preliminary spot checking, the patch appears to ha...
[20:03:48] <wikibugs>	 (03CR) 10Ottomata: [WIP] Refactor EventLoggingSanitization to a generic job: RefineSanitize (037 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/670321 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata)
[20:03:51] <wikibugs>	 (03PS3) 10Ottomata: [WIP] Refactor EventLoggingSanitization to a generic job: RefineSanitize [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/670321 (https://phabricator.wikimedia.org/T273789)
[20:07:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor EventLoggingSanitization to a generic job: RefineSanitize [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/670321 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata)
[20:08:06] <razzi>	 !log starting reboot of matomo1002 for kernel upgrade
[20:08:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:10:08] <wikibugs>	 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10MW-1.36-notes (1.36.0-wmf.22; 2020-12-15), and 2 others: [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10egardner) 05Open→03Resolved All of our schema updates and instrumentation patches have now b...
[20:20:28] <razzi>	 !log disable maintenance mode for matomo1002
[20:20:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:30:41] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 3 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10brennen) @Mholloway: Thoughts on T277229?
[22:02:53] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade to Superset 1.0 - https://phabricator.wikimedia.org/T272390 (10razzi) @Ottomata and I enabled forwarding traffic from analytics hosts, so teams like Product Analytics with access to the stat boxes will be able to run `ssh -NL 8080:an-tool1...
[22:11:43] <wikibugs>	 10Analytics-Radar, 10Instrument-ClientError: Bot throwing large amount of errors - https://phabricator.wikimedia.org/T264453 (10Jdlrobson) 05Open→03Resolved a:03Jdlrobson This is less of a problem my side. If it becomes a problem again I recommend allowing a maximum of 50 errors from a single client. Unt...
[22:31:41] <razzi>	 !log rebalance kafka partitions for webrequest_upload partition 17
[22:31:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:32:32] <wikibugs>	 (03CR) 10Jdlrobson: universalLanguageSelector: Add new properties (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668743 (https://phabricator.wikimedia.org/T275766) (owner: 10Phuedx)
[22:40:04] <razzi>	 I'm signing off for the day and am taking a vacation day tomorrow, see y'all next week!
[23:55:43] <wikibugs>	 (03PS5) 10Lex Nasser: Create pageviews 'top-per-country' endpoint with tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/657228 (https://phabricator.wikimedia.org/T207171)