[06:05:22] <elukey>	 good morning!
[06:05:38] <elukey>	 I am going to disable the timers as prep step for the TLS maintenance
[06:06:22] <elukey>	 !log stop timers on an-launcher1002 as prep step before maintenance
[06:06:24] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:55:35] <elukey>	 this morning I had the idea of creating https://wikitech.wikimedia.org/wiki/Analytics/Systems/Maintenance_Schedule
[06:55:59] <elukey>	 to have a summary of maintenance schedules if needed
[07:10:28] <klausman>	 Morning!
[07:10:37] <klausman>	 About to start the backup of stat1006
[07:11:00] <elukey>	 ack
[07:12:00] <joal>	 Hi elukey - I forgot yesterday that today is kids day - I'll follow from distance
[07:12:06] <joal>	 Hi klausman 
[07:14:34] <joal>	 elukey: the only problematic job I see for us is the wikitext-history one - You can stop/restart, I'll manage manually covering for it once done
[07:24:34] <elukey>	 joal: bonjour! I can postpone the maintenance if you wish to tomorrow
[07:25:05] <elukey>	 there is also a big mjolnir job running
[07:29:27] <elukey>	 all right so I'll postpone to tomorrow
[07:29:40] <elukey>	 !log re-enable timers on al-launcher1002 - maintenance postponed
[07:29:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:31:12] <elukey>	 klausman: o/
[07:32:16] <elukey>	 one thing to keep in mind for the next reimages - stat1004 was not rebooted (my bad) after the reimage, it is a step that normally it is done by wmf-auto-reimage. There is a icinga alert about some kernel options not enabled for stat1004, we'll have to reboot it
[07:32:29] <elukey>	 I am going to schedule the maintenance for friday
[07:32:33] <elukey>	 (sending an email now)
[07:36:36] <elukey>	 also updated https://wikitech.wikimedia.org/wiki/Analytics/Systems/Maintenance_Schedule
[07:38:00] <klausman>	 Thanks!
[07:38:09] <klausman>	 It looks like this backup is even slower than the last :(
[07:38:32] <klausman>	 for the first 20m or so, I'm calculating about 65MiB/s
[07:40:03] <elukey>	 :(
[07:40:52] <elukey>	 I think we should start re-thinking about using bacula to backup those home dirs
[07:41:23] <elukey>	 there is an agreement with our users that those dirs are not backed up, due to their size etc.., but I fear that a lot of people work as they are
[07:41:37] <elukey>	 (so rely on us to avoid loosing data)
[07:43:03] <klausman>	 The problem is that a bunch of other tasks are also hitting the disk pretty hard
[07:43:14] <elukey>	 yes good point
[07:43:33] <klausman>	 Including one my our very own mforns :)
[07:43:41] <elukey>	 even if incremental backups, in theory should not be so heavy
[07:44:58] <klausman>	 Oh, and one job of yours! :-P
[07:45:28] <elukey>	 on what node?
[07:45:28] <klausman>	 elukey   17473  0.4  0.8 15248908 545316 ?     Sl   Mar30 1111:31 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /usr/lib/spark2/conf/:/usr/lib/spark2/jars/*:/etc/hadoop/conf/ -Xmx6g org.apache.spark.deploy.SparkSubmit --master local[4] --conf spark.driver.memory=6g pyspark-shell
[07:45:32] <klausman>	 1006
[07:45:49] <elukey>	 well that is an idle notebook, I am pretty sure it is not hitting the disk no?
[07:45:57] <klausman>	 20M/s
[07:46:03] <klausman>	 (according to htop)
[07:46:43] <elukey>	 stopped it then, I am a bit confused
[07:47:19] <klausman>	 just start htop as root on that machine, it's currently configured to sort by disk i/o
[07:47:54] <elukey>	 nono I trust you, I am confused about why spark needed that
[07:47:59] <klausman>	 While the backup (tar) is the biggest I/O user, a bunch of other stuff is using the disk as well, in sum more than the tar
[07:48:15] <klausman>	 Most of it is spark, but also python
[07:49:44] <wikibugs>	 10Analytics-Radar, 10Operations, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10elukey) 05Stalled→03Open Also opened https://github.com/cloudera/hue/pull/1271
[08:04:09] <joal>	 elukey: there are chances the spark wikitext job will not be finished tomorrow, so nevermind for that
[08:04:24] <joal>	 not killing mjolnir is nice though :)
[08:06:15] <elukey>	 sigh
[08:08:13] * joal sends ops-love to elukey :
[08:11:58] <joal>	 elukey: there is somthing bizzare on stat1006
[08:15:12] <klausman>	 How so?
[08:22:00] <klausman>	 I've doen some renicing/ionicing to help, but it's helping little, at best
[08:23:40] <joal>	 mforns: would you be nearby by any chance?
[08:27:52] <wikibugs>	 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10Cparle) Sounds good to me @egardner
[08:28:11] <klausman>	 At this rate, the backup will take close to 17 hours, or until midnight UTC
[08:37:57] <klausman>	 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=stat1006&var-datasource=thanos&var-cluster=analytics&viewPanel=31 This also tells me that the network is awfully congested
[08:45:49] <klausman>	 Ok *something* changed
[08:46:03] <klausman>	 tar now gets more than twice the amount of I/O bandwidth from disk
[08:47:06] <joal>	 klausman: A bunch of spark jobs got killed
[08:48:32] <klausman>	 yeah, and restarted, we're back down to ~50MiB/s
[08:49:10] <joal>	 yup
[08:49:12] <klausman>	 Ok, maybe 65MiB/s
[08:50:03] <klausman>	 Also, 591 days of uptime *shudder*
[08:50:07] <klausman>	 491*
[08:53:45] <joal>	 klausman: I don't find an answer to the bandwidth usage :(
[08:54:34] <klausman>	 I suspect the high retransmit rate (TCP) is indicative of network congestion as well.
[08:55:43] <joal>	 right - there are spark drivers (communicationg with workers), and notebooks (communicating with clients) - I can't imagine how this could take 40Mb/s regular :S
[08:59:32] <elukey>	 klausman: I can assure you that we reboot when needed, probably that host didn't have important kernel upgrades to justify a maintenance :)
[09:00:12] <klausman>	 I mean, they're also not easily reachable from outside, AFAICT
[09:00:31] <klausman>	 Then again, we run workloads from a bunch of people
[09:01:27] <elukey>	 there are also strict firewall rules to prevent random things downloaded via pip to open ports and start services etc..
[09:02:14] <elukey>	 traffic towards production is filtered at the routers level
[09:02:28] <elukey>	 and there is not direct stat100x exposed to the internet
[09:03:06] <elukey>	 the thing that I don't like a lot is that we don't have a clear view of what is the status of each venv/similar, security wise
[09:03:26] <elukey>	 I am pretty sure that in a lot of envs we'd need to pip upgrade packages
[09:05:26] <klausman>	 Do we have any mechanism for draining machines? I.e. let all running jobs complete, but not allwoing new ones?
[09:05:34] <klausman>	 Because I feel that would have helped a lot here.
[09:06:07] <joal>	 klausman: we don't - yet
[09:10:08] <elukey>	 we have always assumed that any job running could have been restarted if needed in theory, but the usage of the home dirs exploded since the last round of debian installs
[09:12:13] <klausman>	 I am tempted to send a bunch of SIGSTOP, get the backups done, then SIGCONT, but if any of them have network connections, that'll break em
[09:12:17] <klausman>	 (also, kidding)
[09:14:54] <elukey>	 one thing that could stop new spark conns/drivers would be to modify ferm rules (temporary) to avoid opening ports for it
[09:15:21] <elukey>	 (the spark shell has a range of ports allowed to be opened to communicate with the driver in Yarn)
[09:15:24] <klausman>	 Would the other end then just error out?
[09:16:35] <elukey>	 I think that the spark shell would fail when bootstrapping since it will fail to communicate with the Yarn app master / driver, but it is a failure that is a bit cryptic if you don't know about the ports etc..
[09:17:11] <klausman>	 Would it retry?
[09:17:33] <klausman>	 Also, how does discovery work?
[09:17:57] <klausman>	 I feel it would be nicer to just have the driver not even know the host exists/accepts work
[09:18:36] <klausman>	 426GiB in 2h, and a total of 3.6T :-S
[09:18:48] <elukey>	 the spark client basically starts from a known port (12000 IIRC) and adds +1 each time that it finds it already bound
[09:19:01] <elukey>	 we have 100 ports available in that range
[09:19:24] <elukey>	 it was the compromise that we found to add firewall rules to stat100x while keeping spark working
[09:19:52] <klausman>	 So the clients running on 1006 open a port, and then what? where does the work come from?
[09:20:20] <elukey>	 if you specify spark2-shell --master yarn, the spark driver will run as Yarn application master
[09:20:46] <elukey>	 and the client on stat100x will be basically a lightweight process 
[09:21:02] <elukey>	 the driver in yarn can send info to the spark client on stat100x
[09:21:19] <elukey>	 so the port needs to be open on the stat100x host
[09:21:55] <elukey>	 ah yes the function is called "Block Manager"
[09:22:52] <klausman>	 How does yarn/the driver know about the stat hosts? static configuration?
[09:23:25] <elukey>	 no it is the host that requested the Yarn application in hadoop
[09:23:40] <elukey>	 so to be precise, we have two ranges
[09:23:45] <elukey>	 &R_SERVICE(tcp, 12000:12100, $ANALYTICS_NETWORKS);
[09:23:53] <elukey>	 &R_SERVICE(tcp, 13000:13100, $ANALYTICS_NETWORKS);
[09:24:13] <elukey>	 so the latter is the block manager, and I think it is used only when spark runs in local mode 
[09:24:36] <elukey>	 the former range is for the spark.driver.port
[09:24:55] <elukey>	 "Port for the driver to listen on. This is used for communicating with the executors and the standalone Master."
[09:25:25] <klausman>	 I see. So if anything, we'd need to tell 100x to not run any more jobs
[09:25:41] <elukey>	 so spark2-shell, if executed with --master yarn, asks to the Yarn resource manager on an-master1001 to allocate an application master
[09:25:52] <elukey>	 that will be the "remote" driver, doing the heavy lifting
[09:26:25] <elukey>	 and then it will make sure that executors can/will be allocated on worker nodes via yarn etc..
[09:26:35] <elukey>	 the client and remote driver needs to communicate
[09:26:58] <elukey>	 now if we block those ports spark-shell --master (new sessions) will probably fail
[09:27:30] <klausman>	 I'd just prefer to not break anything already running
[09:27:55] <klausman>	 I mean we could sorta do that with allowing connections in ESTABLISHED state in netfilter, but it seems crude.
[09:28:21] <elukey>	 those should already be left intact if we modify changes to ferm
[09:28:54] <elukey>	 the main issue is if an existing remote driver wants to open another tcp conn to the spark client for some reason
[09:28:58] <elukey>	 in that case it'll break
[09:29:09] <klausman>	 Yeah, I was thinking of that.
[09:29:45] <klausman>	 Besides cron and timers, these jobs are usually started by hand, right?
[09:30:30] <elukey>	 or from jupyter notebooks
[09:30:42] <elukey>	 but it should be almost the same
[09:33:13] <klausman>	 I wonder if a drain could be accomplished by just turning off cron and timers, and restrict SSH login to our team
[09:35:45] <elukey>	 could be an option but the unavailability time would be more
[09:52:32] <klausman>	 Another option would be to announce that at time X, all still-running jobs will be canceled. So people can use the machine until the last minute, including gracefully stopping work if possible,
[09:53:11] <klausman>	 Because I suspect a short but complete downtime is preferable to a long degradation in performance and potentially inconsistent backups.
[09:57:29] <elukey>	 yes this is a good point
[09:58:00] <elukey>	 or we could ask to SRE a regular incremental backup via bacula of those home dirs, that would solve the problem in the long term
[09:58:24] <elukey>	 but I don't think that we have room for this use case this fiscal
[09:59:41] <klausman>	 And I *do* wonder what that job of mforns is doing (PID 2993)
[10:00:15] <klausman>	 Unrelatedly, lunch!
[10:12:14] <wikibugs>	 10Analytics-Clusters, 10Discovery, 10Discovery-Search (Current work), 10Patch-For-Review: Move mjolnir kafka daemon from ES to search-loader VMs - https://phabricator.wikimedia.org/T258245 (10elukey) 05Stalled→03Resolved a:03elukey Closing it then, thanks a lot!
[10:34:18] * elukey lunch!
[11:54:13] <fdans>	 hellooo
[12:02:32] <klausman>	 'lo
[12:12:40] <klausman>	 elukey: I think the table for the maint schedule would be more useful with the date/time (in iso8601) as the first column, then the rest of the current columns in order.
[12:12:59] <klausman>	 Or maybe in this format: 2020-09-23 09:00
[12:32:22] <elukey>	 klausman: sure, feel free to modify ut
[12:32:24] <elukey>	 *it
[12:41:17] <klausman>	 will do
[12:46:30] <klausman>	 and done
[12:50:09] <elukey>	 changed a tiny bit the format in the (), but looks good
[13:08:01] <wikibugs>	 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10CBogen) @dcausse and @EBernhardson, just an FYI about @egardner's commen...
[13:12:48] <klausman>	 wfm
[13:16:39] <wikibugs>	 (03PS1) 10Ottomata: [WIP] Use EventStreamConfig in CamusPartitionChecker [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629377 (https://phabricator.wikimedia.org/T251609)
[13:17:44] <ottomata>	 joal: yt?
[13:20:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Use EventStreamConfig in CamusPartitionChecker [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629377 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[13:23:46] <wikibugs>	 (03CR) 10Jforrester: "recheck; sorry, had to restart CI." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629377 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[13:39:24] <ottomata>	 ahh elukey  today the ops sync conflicts with the MEP sync
[13:39:25] <ottomata>	 hmmm
[13:48:48] <wikibugs>	 10Analytics-Radar, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): PoC on anomaly detection with Flink - https://phabricator.wikimedia.org/T262942 (10dcausse) a:03dcausse
[13:50:36] <elukey>	 ottomata: we can postpone, no problem
[14:05:45] <elukey>	 a-team: one cp node in ulsfo is running varnish 6 and varnishkafka seems working fine
[14:06:33] <milimetric>	 sweet
[14:06:38] <milimetric>	 no-op upgrade, crazy!
[14:06:46] <mforns>	 hi joal, just joined
[14:07:11] <mforns>	 hi klausman, I saw I have a job running? I was not aware, on what machine?
[14:07:20] <klausman>	 stat1006
[14:07:23] <klausman>	 PID 2993
[14:07:30] <ottomata>	 i'm going to move ops sync to tomorrow same time
[14:08:12] <mforns>	 klausman: looking
[14:08:40] <wikibugs>	 (03PS1) 10Ottomata: Spark JsonSchemaConverter - additionalProperties with schema is always a MapType [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629406 (https://phabricator.wikimedia.org/T263466)
[14:09:35] <klausman>	 Started March 11th. well over 2000 CPU-hours accumulated
[14:12:17] <dsaez>	 hi all
[14:12:47] <mforns>	 klausman: man, no idea what that is, it must be some short test I did in jupyter notebooks, and I failed to close it properly
[14:12:58] <mforns>	 you want me to kill it?
[14:13:05] <dsaez>	 question, is this: https://wikitech.wikimedia.org/wiki/Analytics/Web_publication supposed to work on any stat machine? I've tried from stat1006-8 and just worked for stat1007
[14:14:27] <elukey>	 yep it should
[14:21:28] <milimetric>	 elukey: I had to login to hue-next again and it didn't do the double login thing
[14:21:58] <elukey>	 milimetric: weird... it didn't happen to me, not sure what happened before :(
[14:22:31] <elukey>	 ottomata: so mkfs.ext4 takes a -L label parameter, that can be used in fstab instead of the uuid (TIL)
[14:22:54] <elukey>	 so I am thinking to now add labels like "hadoop-$letter" to disk partitions
[14:23:10] <dsaez>	 elukey: http://pastie.org/p/56KTJEKydSQi4mx5uBYFAn I think is not working 
[14:23:16] <elukey>	 for swift the SRE team automates further and puppet takes care of formatting and mounting
[14:24:13] <dsaez>	 I'm just being able to see stat1007 sync with https://analytics.wikimedia.org/published/
[14:24:14] <ottomata>	 puppet formats? wow
[14:24:51] <ottomata>	 elukey:  what do yo think about using just one parameter with full paths?
[14:24:56] <elukey>	 ottomata: see swift::init_device
[14:25:14] <elukey>	 eventually we could move to a similar scheme
[14:25:36] <elukey>	 what label do you have in mind?
[14:25:40] <elukey>	 I am open to any suggestion
[14:26:33] <ottomata>	 elukey:  very cool!  i guess some swift admin did not want to deal with partman but still wanted to automate :)
[14:26:51] <elukey>	 ahahha yes
[14:26:54] <elukey>	 I created https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/629384
[14:26:58] <ottomata>	 elukey:  i don't care about the label, using 'b', 'c', or hdfs-1, hdfs-2 is fine with me too
[14:26:58] <elukey>	 with all the steps 
[14:27:16] <ottomata>	 i was just saying that perhaps we should have a single parameter to the class
[14:27:17] <ottomata>	 like
[14:27:19] <elukey>	 for the moment is hadoop-$letter
[14:27:23] <ottomata>	 that's fine
[14:27:38] <elukey>	 ahhh wait you are talking about the puppet change
[14:27:40] <elukey>	 okok sorry 
[14:27:42] <ottomata>	 datanode_mounts: ["/srv/data/hdfs/a", "/srv/data/hdfs/b", ...] (or whatever the path is)
[14:27:43] <ottomata>	 yes
[14:27:59] <ottomata>	 rather than two that try to put the labels together with a basedir
[14:27:59] <elukey>	 page fault, it took a bit
[14:28:02] <ottomata>	 haha
[14:28:22] <elukey>	 yes yes I am going to follow your suggestion, I like it
[14:31:03] <klausman>	 oh, fstab can use any device specifier mount itself can user LABEL= UUID= etc etc, as long as they are unique
[14:33:37] <milimetric>	 mforns: wanna pair on the event swap thing?
[14:33:45] <mforns>	 milimetric: sure!
[14:33:47] <mforns>	 bc?
[14:34:09] <milimetric>	 yes, omw uh... give me 1 min. gonna get some water
[14:35:12] <mforns>	 me too
[14:35:40] <elukey>	 klausman: yep we currently use UUID but LABEL is way nicer! :)
[14:41:22] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Product-Infrastructure-Data, 10Product-Infrastructure-Team-Backlog, and 4 others: Documentation of client side error logging capabilities on mediawiki - https://phabricator.wikimedia.org/T248884 (10LGoto)
[15:10:40] <elukey>	 joal: https://cwiki.apache.org/confluence/display/BIGTOP/Bigtop+1.4.0+Release - I see alluxio mentioned in there, just realized :)
[15:14:06] <klausman>	 1.9T of backup of 1006 done, only 1.7 more to go!
[15:36:36] <elukey>	 cookbook works! Just added all the 22 partitions on an-worker1096
[15:39:04] <elukey>	 dsaez: sorry I got distracted, I see test-diego in https://analytics.wikimedia.org/published/datasets/one-off/
[15:39:15] <elukey>	 did you rsync it from another node?
[15:54:19] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Product-Infrastructure-Data, 10Wikimedia-Logstash, and 3 others: Documentation of client side error logging capabilities on mediawiki - https://phabricator.wikimedia.org/T248884 (10LGoto)
[15:54:58] <nuria>	 a-team: i have meeting with  t & C today and need to miss standup 
[15:55:15] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Create a cookbook to automate the bootstrap of new Hadoop workers - https://phabricator.wikimedia.org/T262189 (10elukey) a:03elukey
[15:57:04] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Create a cookbook to automate the bootstrap of new Hadoop workers - https://phabricator.wikimedia.org/T262189 (10elukey) The cookbook now works, I was able to add all the partitions on an-worker1096->1101. In the current version I forgot to add the journalnode partitio...
[15:58:03] <joal>	 Hi folks - just joining back after kids 
[15:58:16] <joal>	 mforns: I bet you managed the job with klausman and elukey :)
[15:58:25] <joal>	 ottomata: Let'xs talk after standup?
[15:58:48] <mforns>	 joal: the zombie job? yes
[15:58:55] <joal>	 cool mforns 
[15:58:58] <joal>	 thanks for that :)
[15:59:06] <ottomata>	 joal:  ya, want to know if you have objections to https://phabricator.wikimedia.org/T263466
[15:59:09] <mforns>	 thank you for discovering and pinging!
[15:59:11] <ottomata>	 (am already implementing)
[15:59:15] <ottomata>	 https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/629406
[16:37:35] <dsaez>	 elukey, really ? I don't see ... weird
[16:37:49] <wikibugs>	 10Analytics, 10Machine Learning Platform (Research): Investigate formal test framework for Oozie jobs - https://phabricator.wikimedia.org/T213496 (10calbon) 05Open→03Declined
[16:37:51] <wikibugs>	 10Analytics-Radar, 10Dumps-Generation, 10Machine Learning Platform, 10ORES: Produce dump files for ORES scores - https://phabricator.wikimedia.org/T209739 (10calbon)
[16:38:41] <dsaez>	 elukey I just see folders and English Wikipedia Page Views by Topics.html	
[16:45:28] <milimetric>	 mforns: I'm gonna have some food and then keep patching up that data, I'll send an email or something with the result when it looks good
[16:45:41] <milimetric>	 (but I don't think I need to take up your time with it)
[16:45:57] <mforns>	 milimetric: ok! let me know if you need a second pair of eyes!
[17:03:44] <elukey>	 dsaez: mmm can you try to force a refresh of the page without cache?
[17:06:24] <dsaez>	 elukey: weird cache from my side, but you are right, from my phone I can see it. I've cleaned history and cookies on my browser and it is not refresing... anyhow, thank you, and sorry for bothering.
[17:16:26] <elukey>	 dsaez: nono don't worry, if you have troubles feel free to reach out, it might also be a varnish caching problem
[17:20:33] <mforns>	 elukey: in hadoop test cluster, what's the correct way to pull refinery code?
[17:20:43] <mforns>	 can we scap deploy there?
[17:20:47] <mforns>	 or just git pull
[17:22:14] <elukey>	 mforns: in theory scap should be able to deploy to analytics1030, the coordinator
[17:22:19] <elukey>	 git pull is also just fine
[17:22:34] <elukey>	 beware of the spam to analytics-alerts@
[17:22:39] <mforns>	 oh ok
[17:22:49] <elukey>	 I mean git grep / replace etc..
[17:23:03] <elukey>	 ottomata: 
[17:23:28] <mforns>	 elukey: you mean alerts when deploying or when running the oozie jobs?
[17:23:34] <elukey>	 in regex yaml, with your idea, I'll have to put the regex of all hadoop (non gpu) workers right?
[17:23:46] <elukey>	 mforns: nono when oozie jobs fail
[17:24:17] <mforns>	 ok ok
[17:41:18] <wikibugs>	 10Analytics, 10Event-Platform: Figure out where stream/schema annotations belong (for sanitization and other use cases) - https://phabricator.wikimedia.org/T263672 (10Ottomata)
[17:41:37] <wikibugs>	 10Analytics, 10Event-Platform: Figure out where stream/schema annotations belong (for sanitization and other use cases) - https://phabricator.wikimedia.org/T263672 (10Ottomata)
[17:41:39] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 6 others: Modern Event Platform: Schema Guidelines and Conventions - https://phabricator.wikimedia.org/T214093 (10Ottomata)
[17:41:42] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata)
[17:42:06] <wikibugs>	 10Analytics, 10Event-Platform: Figure out where stream/schema annotations belong (for sanitization and other use cases) - https://phabricator.wikimedia.org/T263672 (10Ottomata)
[17:43:09] <wikibugs>	 10Analytics, 10Event-Platform: Figure out where stream/schema annotations belong (for sanitization and other use cases) - https://phabricator.wikimedia.org/T263672 (10Ottomata)
[17:59:42] <wikibugs>	 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10serviceops, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10jijiki) @Milimetric that would be great, if it is not too much work, I would appreciate it. I will work on the varnish...
[18:01:23] <wikibugs>	 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10serviceops, 10User-jijiki: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10jijiki)
[18:19:13] <milimetric>	 a-team: was trying to find archiva setup instructions for Razzi and ran into this empty doc (and I forgot how to do it): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Archiva
[18:19:23] <milimetric>	 currently looking around but if anyone knows it off the top of their head
[18:20:01] <elukey>	 milimetric: what do you mean with setup instructions?
[18:20:32] <milimetric>	 I think this page has a bunch of it: https://wikitech.wikimedia.org/wiki/Archiva
[18:20:55] <milimetric>	 elukey: I mean like setting up the refinery-source repo to build properly (so it knows where our archiva is and all that)
[18:21:07] <elukey>	 ahhh from the dev point of view
[18:21:11] <wikibugs>	 10Analytics: Kerberos identity for razzi - https://phabricator.wikimedia.org/T263676 (10razzi)
[18:21:25] <wikibugs>	 10Analytics, 10Analytics-Kanban: Check whether mediawiki production event data is equivalent to mediawiki-history data over a month  - https://phabricator.wikimedia.org/T262261 (10Nuria)
[18:21:44] <nuria>	 milimetric: nothing shoudl be needed
[18:21:48] <joal>	 milimetric: refinery-source should build as-is using archiva (or at least I think it should?)
[18:21:57] <nuria>	 milimetric: (cc razzI) the pom.xml
[18:22:00] <milimetric>	 hm... I was just thinking... but it fails for Razzi with very obvious errors
[18:22:07] <nuria>	 has all info needed to pULL from archiva
[18:22:12] <milimetric>	 like "LogHelper" not found
[18:22:17] <joal>	 hm
[18:22:18] <milimetric>	 maybe gitfat isn't set up
[18:22:29] <joal>	 shouldn't be related
[18:22:37] <nuria>	 milimetric: the build fails?
[18:22:40] <joal>	 what is the maven command milimetric, razzi ?
[18:22:40] <nuria>	 cc razzi
[18:22:52] <milimetric>	 just mvn package fails (or mvn test or anything)
[18:22:58] <milimetric>	 so I think it's git fat, one sec let us set that up
[18:23:33] <joal>	 milimetric: git-fat is for refinery, not source IIUC
[18:24:12] <milimetric>	 that makes sense...
[18:24:13] <joal>	 milimetric, razzi - can you try with a clean before?
[18:24:19] <joal>	 mvn clean package
[18:24:20] <razzi>	 Trying now
[18:24:22] <milimetric>	 hm... so what's failing
[18:24:46] <joal>	 LogHelper is internal to refinery-source (in refinery-core)
[18:26:30] <milimetric>	 we have a bunch of confusing docs we should probably delete about deploying and updating .m2/settings for releases to archiva
[18:26:36] <milimetric>	 (we don't do that anymore right?)
[18:28:14] <joal>	 milimetric: we need that to deploy through manual maven deploy to archiva
[18:28:47] <milimetric>	 oh I thought we always did the jenkins thing
[18:29:04] <joal>	 milimetric: for refinery we do, for other repos not really
[18:31:27] <razzi>	 joal: ok cool `mvn clean package` worked
[18:31:31] <joal>	 \o/
[18:32:44] <joal>	 razzi, milimetric: if a previous build has failed, the state of failed modules can be wrong - I always use clean before building (except when recompiling after success)
[18:32:49] <elukey>	 going afk, o/
[18:32:53] <joal>	 bye elukey 
[18:33:08] <milimetric>	 maybe we should just add "clean" as a step to package and test?
[18:33:12] <milimetric>	 oh not test...
[18:33:14] <milimetric>	 right...
[18:33:19] <joal>	 ottomata: couldn't say in the metting but: I love your flying bikes :)
[18:34:03] <ottomata>	 haha
[18:34:07] <ottomata>	 you want them?
[18:34:14] <ottomata>	 joal actually, one of them is fabians! :p
[18:34:26] <joal>	 :)
[18:34:38] <joal>	 milimetric: feasible (http://maven.apache.org/plugins/maven-clean-plugin/usage.html)
[18:34:40] <ottomata>	 razzi: i just finished with meetings for the day!  let me know if there is anything i can help you with!
[18:34:57] <joal>	 ok meetings done for me as well - gone for tonight :)
[18:35:05] <milimetric>	 o/
[18:35:07] <ottomata>	 laters joal!
[18:36:28] <razzi>	 ottomata: I'm about to get lunch, then I'll be trying to experiment with oozie on the hadoop test cluster for https://phabricator.wikimedia.org/T262660. Think you could help with that?
[18:37:26] <wikibugs>	 10Analytics, 10Event-Platform: Figure out where stream/schema annotations belong (for sanitization and other use cases) - https://phabricator.wikimedia.org/T263672 (10Ottomata)
[18:37:39] <ottomata>	 ya for sure
[18:38:50] <razzi>	 Cool. Catch you in a bit
[18:38:53] <elukey>	 razzi: https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Hadoop_testing_cluster#Map_of_the_testing_hosts to get a list of hosts and what they do
[18:39:02] <dsaez>	 elukey, thanks!
[18:39:46] <wikibugs>	 (03CR) 10Ottomata: [C: 04-1] "Given the discussion in https://phabricator.wikimedia.org/T263672 we should probably hold off on this, don't bother reviewing yet." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629406 (https://phabricator.wikimedia.org/T263466) (owner: 10Ottomata)
[18:51:57] <wikibugs>	 10Analytics-Radar, 10Machine Learning Platform, 10revscoring, 10artificial-intelligence: [Investigate] Hadoop integration for ORES training - https://phabricator.wikimedia.org/T170650 (10ACraze) 05Open→03Resolved
[18:56:44] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Privacy Engineering, 10Product-Analytics, and 3 others: Drop data from Prefupdate schema that is older than 90 days - https://phabricator.wikimedia.org/T250049 (10Milimetric) Ok, data is swapped.  @nettrom_WMF if you don't mind, could you check out both `event.PrefUpdate`...
[19:01:14] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Product-Infrastructure-Data, 10Wikimedia-Logstash, and 4 others: Documentation of client side error logging capabilities on mediawiki - https://phabricator.wikimedia.org/T248884 (10Ottomata) (Oops wrong Bug: # ^)
[19:02:12] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-jijiki: pageview=0 in X-Analytics supersedes anything else - https://phabricator.wikimedia.org/T263683 (10Milimetric)
[19:16:54] <mforns>	 razzi, I will take a pause in short and then be back to work for a bit more, do you want to pair again?
[19:18:46] <razzi>	 mforns: Yeah, ping me when you're back
[19:21:17] <mforns>	 ok!
[19:23:26] <milimetric>	 nuria: check out T263683, just added, should be quick and looks useful for a debug case that's coming up soon-ish
[19:23:27] <stashbot>	 T263683: pageview=0 in X-Analytics supersedes anything else - https://phabricator.wikimedia.org/T263683
[19:24:26] <nuria>	 milimetric: you can send pageview=preview
[19:24:30] <nuria>	 and that woudl happen
[19:24:36] <nuria>	 milimetric: it will not be counted
[19:24:39] <nuria>	 milimetric: right?
[19:25:11] <milimetric>	 I thought that... but then it would be kind of confusing, it's not really a preview, it should just be dropped entirely, I didn't want to overload that semantic
[19:25:40] <milimetric>	 (but code-wise, yeah, we can treat it the same way, basically add pageview=preview || pageview=0
[19:25:42] <milimetric>	 )
[19:25:53] <milimetric>	 or pageview=debug?
[19:25:56] <milimetric>	 maybe that's better
[19:26:07] <milimetric>	 nuria: ^
[19:29:07] <nuria>	 milimetric: my opinion would be that the least we modify that code for special casing the best so if we can use the existing preview that , i think, should be fine
[19:29:59] <milimetric>	 nuria: but then what if someone tries to analyze pageview=preview independent of our tooling, they wouldn't know about this case
[19:30:29] <milimetric>	 I am fairly strongly opposed to that overloading, in other words
[19:30:55] <milimetric>	 (just because I feel like I'd have no way to communicate it to a data analyst / researcher looking at that field)
[19:30:59] <nuria>	 milimetric: that is not a use case we have had in 5 years of adding that preview marker
[19:31:09] <nuria>	 milimetric: so i'd say it is pretty unlikey
[19:31:14] <nuria>	 * unlikely
[19:31:29] <milimetric>	 how do we know?  people can write arbitrary queries, maybe they've looked at requests that way
[19:32:08] <nuria>	 milimetric: no, i disagree that is realistic
[19:32:37] <nuria>	 milimetric: I am not sure if you are thinking that changes are only in the pageview definition
[19:32:45] <nuria>	 milimetric: they woudl need to happen on varnish as well
[19:32:47] <nuria>	 *would
[19:33:08] <milimetric>	 yeah, effie's doing those
[19:33:36] <nuria>	 milimetric: my thought on that is that we should not really have any code on varnish like that at all , the fact that it exists is a smell
[19:33:50] <nuria>	 milimetric: and we should not add to it 
[19:34:05] <nuria>	 milimetric:the cookie setting i get
[19:34:12] <nuria>	 milimetric: cause there is no other way 
[19:34:44] <nuria>	 milimetric: but tagging requests is probably something that - if possible- when we migrate to ats  we shoudl think of removing at some point
[19:34:54] <nuria>	 milimetric: we can consult the team and see what others think
[19:36:12] <milimetric>	 sure, I am happy to think of better ways, I'll tell effie to pause for a moment.  The problem is, basically, how do you allow external debug requests that would land in wmf_raw.webrequest and make sure they don't affect any of our analyses (I'll explain this in the task and we can triage it tomorrow)
[19:37:11] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-jijiki: pageview=0 in X-Analytics supersedes anything else - https://phabricator.wikimedia.org/T263683 (10Milimetric) @jijiki in discussing this with the team we want to brainstorm about it a bit.  Some think there might be a better way.  Give us until end of day tomor...
[19:39:29] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-jijiki: Mechanism to flag webrequests as "debug" - https://phabricator.wikimedia.org/T263683 (10Milimetric)
[19:42:55] <wikibugs>	 10Analytics: Kerberos identity for razzi - https://phabricator.wikimedia.org/T263676 (10razzi) 05Open→03Resolved a:03razzi Turns out I already had kerberos, I just forgot where I put my password :)
[19:51:54] <wikibugs>	 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10Ottomata) Hi @egardner! You've got everything right!  I'll respond to a...
[19:58:29] * ottomata  forgot to change my nick back oops
[19:58:46] <ottomata>	 milimetric:  just read that ticket, why are there varnish changes needed?
[19:59:11] <ottomata>	 couldn't effie just set that pageview=0 in the X-Analytics header when doing those types of tests?
[19:59:22] <ottomata>	 also...does the pageview pipeline respect that if it is set externally?
[19:59:43] <ottomata>	 maybe it reasons about pageview=1, but will it avoid tagging something that looks like a desktop pageview as a pageview if pageview=0?
[20:00:52] <milimetric>	 ottomata: I'm not sure why, but I'm fairly certain effie knows one way or the other, but yeah, someone could just set whatever they need in X-Analytics (not sure if the varnish code overwrites it, maybe that's the problem)
[20:01:25] <milimetric>	 as for the pageview definition, I've no idea what would happen if you sent pageview=0, but I think it would just ignore it and categorize based on the other data
[20:13:28] <mforns>	 hey razzi, I'm back :]
[20:13:39] <razzi>	 mforns: cool, cya in the batcave?
[20:13:43] <mforns>	 ok
[20:15:40] <razzi>	 ottomata: Feel free to join the batcave and talk oozie config with mforns and I
[20:20:56] <ottomata>	 oh be there in 2!
[20:25:12] <wikibugs>	 (03CR) 10Jenniferwang: "> Patch Set 3: Code-Review+1" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628235 (https://phabricator.wikimedia.org/T262499) (owner: 10Jenniferwang)
[21:36:13] <nuria>	 milimetric:, ottomata : varnish does not pass whatever is set in x-nalytics
[21:36:19] <nuria>	 *x-analytics
[21:37:10] <nuria>	 milimetric: if you send pageview=0  nothing happens
[21:37:17] <nuria>	 milimetric: it does not get passed on
[21:37:54] <nuria>	 milimetric: 
[21:37:57] <nuria>	 https://www.irccloud.com/pastebin/pDsuPqAy/
[21:41:48] <wikibugs>	 (03CR) 10Nuria: [V: 03+2 C: 03+2] Add SpecialMuteSubmit schema to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628235 (https://phabricator.wikimedia.org/T262499) (owner: 10Jenniferwang)
[21:50:11] <wikibugs>	 10Analytics, 10Event-Platform: Q2 goal. Deploy the canary event monitoring for some event streams - https://phabricator.wikimedia.org/T263696 (10Nuria)
[21:56:28] <wikibugs>	 10Analytics, 10Operations, 10serviceops-radar, 10Article-Recommendation, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Dzahn)
[22:00:05] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Add more popular articles per country data to AQS - https://phabricator.wikimedia.org/T263697 (10Nuria)
[22:02:02] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Add more popular articles per country data to AQS - https://phabricator.wikimedia.org/T263697 (10Nuria)
[22:48:27] <wikibugs>	 10Analytics, 10Analytics-Data-Quality: Import of MediaWiki tables into the Data Lakes mangles usernames - https://phabricator.wikimedia.org/T230915 (10Nuria) a:03lexnasser