[06:05:22] good morning! [06:05:38] I am going to disable the timers as prep step for the TLS maintenance [06:06:22] !log stop timers on an-launcher1002 as prep step before maintenance [06:06:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:55:35] this morning I had the idea of creating https://wikitech.wikimedia.org/wiki/Analytics/Systems/Maintenance_Schedule [06:55:59] to have a summary of maintenance schedules if needed [07:10:28] Morning! [07:10:37] About to start the backup of stat1006 [07:11:00] ack [07:12:00] Hi elukey - I forgot yesterday that today is kids day - I'll follow from distance [07:12:06] Hi klausman [07:14:34] elukey: the only problematic job I see for us is the wikitext-history one - You can stop/restart, I'll manage manually covering for it once done [07:24:34] joal: bonjour! I can postpone the maintenance if you wish to tomorrow [07:25:05] there is also a big mjolnir job running [07:29:27] all right so I'll postpone to tomorrow [07:29:40] !log re-enable timers on al-launcher1002 - maintenance postponed [07:29:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:31:12] klausman: o/ [07:32:16] one thing to keep in mind for the next reimages - stat1004 was not rebooted (my bad) after the reimage, it is a step that normally it is done by wmf-auto-reimage. There is a icinga alert about some kernel options not enabled for stat1004, we'll have to reboot it [07:32:29] I am going to schedule the maintenance for friday [07:32:33] (sending an email now) [07:36:36] also updated https://wikitech.wikimedia.org/wiki/Analytics/Systems/Maintenance_Schedule [07:38:00] Thanks! [07:38:09] It looks like this backup is even slower than the last :( [07:38:32] for the first 20m or so, I'm calculating about 65MiB/s [07:40:03] :( [07:40:52] I think we should start re-thinking about using bacula to backup those home dirs [07:41:23] there is an agreement with our users that those dirs are not backed up, due to their size etc.., but I fear that a lot of people work as they are [07:41:37] (so rely on us to avoid loosing data) [07:43:03] The problem is that a bunch of other tasks are also hitting the disk pretty hard [07:43:14] yes good point [07:43:33] Including one my our very own mforns :) [07:43:41] even if incremental backups, in theory should not be so heavy [07:44:58] Oh, and one job of yours! :-P [07:45:28] on what node? [07:45:28] elukey 17473 0.4 0.8 15248908 545316 ? Sl Mar30 1111:31 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /usr/lib/spark2/conf/:/usr/lib/spark2/jars/*:/etc/hadoop/conf/ -Xmx6g org.apache.spark.deploy.SparkSubmit --master local[4] --conf spark.driver.memory=6g pyspark-shell [07:45:32] 1006 [07:45:49] well that is an idle notebook, I am pretty sure it is not hitting the disk no? [07:45:57] 20M/s [07:46:03] (according to htop) [07:46:43] stopped it then, I am a bit confused [07:47:19] just start htop as root on that machine, it's currently configured to sort by disk i/o [07:47:54] nono I trust you, I am confused about why spark needed that [07:47:59] While the backup (tar) is the biggest I/O user, a bunch of other stuff is using the disk as well, in sum more than the tar [07:48:15] Most of it is spark, but also python [07:49:44] 10Analytics-Radar, 10Operations, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10elukey) 05Stalled→03Open Also opened https://github.com/cloudera/hue/pull/1271 [08:04:09] elukey: there are chances the spark wikitext job will not be finished tomorrow, so nevermind for that [08:04:24] not killing mjolnir is nice though :) [08:06:15] sigh [08:08:13] * joal sends ops-love to elukey : [08:11:58] elukey: there is somthing bizzare on stat1006 [08:15:12] How so? [08:22:00] I've doen some renicing/ionicing to help, but it's helping little, at best [08:23:40] mforns: would you be nearby by any chance? [08:27:52] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10Cparle) Sounds good to me @egardner [08:28:11] At this rate, the backup will take close to 17 hours, or until midnight UTC [08:37:57] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=stat1006&var-datasource=thanos&var-cluster=analytics&viewPanel=31 This also tells me that the network is awfully congested [08:45:49] Ok *something* changed [08:46:03] tar now gets more than twice the amount of I/O bandwidth from disk [08:47:06] klausman: A bunch of spark jobs got killed [08:48:32] yeah, and restarted, we're back down to ~50MiB/s [08:49:10] yup [08:49:12] Ok, maybe 65MiB/s [08:50:03] Also, 591 days of uptime *shudder* [08:50:07] 491* [08:53:45] klausman: I don't find an answer to the bandwidth usage :( [08:54:34] I suspect the high retransmit rate (TCP) is indicative of network congestion as well. [08:55:43] right - there are spark drivers (communicationg with workers), and notebooks (communicating with clients) - I can't imagine how this could take 40Mb/s regular :S [08:59:32] klausman: I can assure you that we reboot when needed, probably that host didn't have important kernel upgrades to justify a maintenance :) [09:00:12] I mean, they're also not easily reachable from outside, AFAICT [09:00:31] Then again, we run workloads from a bunch of people [09:01:27] there are also strict firewall rules to prevent random things downloaded via pip to open ports and start services etc.. [09:02:14] traffic towards production is filtered at the routers level [09:02:28] and there is not direct stat100x exposed to the internet [09:03:06] the thing that I don't like a lot is that we don't have a clear view of what is the status of each venv/similar, security wise [09:03:26] I am pretty sure that in a lot of envs we'd need to pip upgrade packages [09:05:26] Do we have any mechanism for draining machines? I.e. let all running jobs complete, but not allwoing new ones? [09:05:34] Because I feel that would have helped a lot here. [09:06:07] klausman: we don't - yet [09:10:08] we have always assumed that any job running could have been restarted if needed in theory, but the usage of the home dirs exploded since the last round of debian installs [09:12:13] I am tempted to send a bunch of SIGSTOP, get the backups done, then SIGCONT, but if any of them have network connections, that'll break em [09:12:17] (also, kidding) [09:14:54] one thing that could stop new spark conns/drivers would be to modify ferm rules (temporary) to avoid opening ports for it [09:15:21] (the spark shell has a range of ports allowed to be opened to communicate with the driver in Yarn) [09:15:24] Would the other end then just error out? [09:16:35] I think that the spark shell would fail when bootstrapping since it will fail to communicate with the Yarn app master / driver, but it is a failure that is a bit cryptic if you don't know about the ports etc.. [09:17:11] Would it retry? [09:17:33] Also, how does discovery work? [09:17:57] I feel it would be nicer to just have the driver not even know the host exists/accepts work [09:18:36] 426GiB in 2h, and a total of 3.6T :-S [09:18:48] the spark client basically starts from a known port (12000 IIRC) and adds +1 each time that it finds it already bound [09:19:01] we have 100 ports available in that range [09:19:24] it was the compromise that we found to add firewall rules to stat100x while keeping spark working [09:19:52] So the clients running on 1006 open a port, and then what? where does the work come from? [09:20:20] if you specify spark2-shell --master yarn, the spark driver will run as Yarn application master [09:20:46] and the client on stat100x will be basically a lightweight process [09:21:02] the driver in yarn can send info to the spark client on stat100x [09:21:19] so the port needs to be open on the stat100x host [09:21:55] ah yes the function is called "Block Manager" [09:22:52] How does yarn/the driver know about the stat hosts? static configuration? [09:23:25] no it is the host that requested the Yarn application in hadoop [09:23:40] so to be precise, we have two ranges [09:23:45] &R_SERVICE(tcp, 12000:12100, $ANALYTICS_NETWORKS); [09:23:53] &R_SERVICE(tcp, 13000:13100, $ANALYTICS_NETWORKS); [09:24:13] so the latter is the block manager, and I think it is used only when spark runs in local mode [09:24:36] the former range is for the spark.driver.port [09:24:55] "Port for the driver to listen on. This is used for communicating with the executors and the standalone Master." [09:25:25] I see. So if anything, we'd need to tell 100x to not run any more jobs [09:25:41] so spark2-shell, if executed with --master yarn, asks to the Yarn resource manager on an-master1001 to allocate an application master [09:25:52] that will be the "remote" driver, doing the heavy lifting [09:26:25] and then it will make sure that executors can/will be allocated on worker nodes via yarn etc.. [09:26:35] the client and remote driver needs to communicate [09:26:58] now if we block those ports spark-shell --master (new sessions) will probably fail [09:27:30] I'd just prefer to not break anything already running [09:27:55] I mean we could sorta do that with allowing connections in ESTABLISHED state in netfilter, but it seems crude. [09:28:21] those should already be left intact if we modify changes to ferm [09:28:54] the main issue is if an existing remote driver wants to open another tcp conn to the spark client for some reason [09:28:58] in that case it'll break [09:29:09] Yeah, I was thinking of that. [09:29:45] Besides cron and timers, these jobs are usually started by hand, right? [09:30:30] or from jupyter notebooks [09:30:42] but it should be almost the same [09:33:13] I wonder if a drain could be accomplished by just turning off cron and timers, and restrict SSH login to our team [09:35:45] could be an option but the unavailability time would be more [09:52:32] Another option would be to announce that at time X, all still-running jobs will be canceled. So people can use the machine until the last minute, including gracefully stopping work if possible, [09:53:11] Because I suspect a short but complete downtime is preferable to a long degradation in performance and potentially inconsistent backups. [09:57:29] yes this is a good point [09:58:00] or we could ask to SRE a regular incremental backup via bacula of those home dirs, that would solve the problem in the long term [09:58:24] but I don't think that we have room for this use case this fiscal [09:59:41] And I *do* wonder what that job of mforns is doing (PID 2993) [10:00:15] Unrelatedly, lunch! [10:12:14] 10Analytics-Clusters, 10Discovery, 10Discovery-Search (Current work), 10Patch-For-Review: Move mjolnir kafka daemon from ES to search-loader VMs - https://phabricator.wikimedia.org/T258245 (10elukey) 05Stalled→03Resolved a:03elukey Closing it then, thanks a lot! [10:34:18] * elukey lunch! [11:54:13] hellooo [12:02:32] 'lo [12:12:40] elukey: I think the table for the maint schedule would be more useful with the date/time (in iso8601) as the first column, then the rest of the current columns in order. [12:12:59] Or maybe in this format: 2020-09-23 09:00 [12:32:22] klausman: sure, feel free to modify ut [12:32:24] *it [12:41:17] will do [12:46:30] and done [12:50:09] changed a tiny bit the format in the (), but looks good [13:08:01] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10CBogen) @dcausse and @EBernhardson, just an FYI about @egardner's commen... [13:12:48] wfm [13:16:39] (03PS1) 10Ottomata: [WIP] Use EventStreamConfig in CamusPartitionChecker [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629377 (https://phabricator.wikimedia.org/T251609) [13:17:44] joal: yt? [13:20:05] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Use EventStreamConfig in CamusPartitionChecker [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629377 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [13:23:46] (03CR) 10Jforrester: "recheck; sorry, had to restart CI." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629377 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [13:39:24] ahh elukey today the ops sync conflicts with the MEP sync [13:39:25] hmmm [13:48:48] 10Analytics-Radar, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): PoC on anomaly detection with Flink - https://phabricator.wikimedia.org/T262942 (10dcausse) a:03dcausse [13:50:36] ottomata: we can postpone, no problem [14:05:45] a-team: one cp node in ulsfo is running varnish 6 and varnishkafka seems working fine [14:06:33] sweet [14:06:38] no-op upgrade, crazy! [14:06:46] hi joal, just joined [14:07:11] hi klausman, I saw I have a job running? I was not aware, on what machine? [14:07:20] stat1006 [14:07:23] PID 2993 [14:07:30] i'm going to move ops sync to tomorrow same time [14:08:12] klausman: looking [14:08:40] (03PS1) 10Ottomata: Spark JsonSchemaConverter - additionalProperties with schema is always a MapType [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629406 (https://phabricator.wikimedia.org/T263466) [14:09:35] Started March 11th. well over 2000 CPU-hours accumulated [14:12:17] hi all [14:12:47] klausman: man, no idea what that is, it must be some short test I did in jupyter notebooks, and I failed to close it properly [14:12:58] you want me to kill it? [14:13:05] question, is this: https://wikitech.wikimedia.org/wiki/Analytics/Web_publication supposed to work on any stat machine? I've tried from stat1006-8 and just worked for stat1007 [14:14:27] yep it should [14:21:28] elukey: I had to login to hue-next again and it didn't do the double login thing [14:21:58] milimetric: weird... it didn't happen to me, not sure what happened before :( [14:22:31] ottomata: so mkfs.ext4 takes a -L label parameter, that can be used in fstab instead of the uuid (TIL) [14:22:54] so I am thinking to now add labels like "hadoop-$letter" to disk partitions [14:23:10] elukey: http://pastie.org/p/56KTJEKydSQi4mx5uBYFAn I think is not working [14:23:16] for swift the SRE team automates further and puppet takes care of formatting and mounting [14:24:13] I'm just being able to see stat1007 sync with https://analytics.wikimedia.org/published/ [14:24:14] puppet formats? wow [14:24:51] elukey: what do yo think about using just one parameter with full paths? [14:24:56] ottomata: see swift::init_device [14:25:14] eventually we could move to a similar scheme [14:25:36] what label do you have in mind? [14:25:40] I am open to any suggestion [14:26:33] elukey: very cool! i guess some swift admin did not want to deal with partman but still wanted to automate :) [14:26:51] ahahha yes [14:26:54] I created https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/629384 [14:26:58] elukey: i don't care about the label, using 'b', 'c', or hdfs-1, hdfs-2 is fine with me too [14:26:58] with all the steps [14:27:16] i was just saying that perhaps we should have a single parameter to the class [14:27:17] like [14:27:19] for the moment is hadoop-$letter [14:27:23] that's fine [14:27:38] ahhh wait you are talking about the puppet change [14:27:40] okok sorry [14:27:42] datanode_mounts: ["/srv/data/hdfs/a", "/srv/data/hdfs/b", ...] (or whatever the path is) [14:27:43] yes [14:27:59] rather than two that try to put the labels together with a basedir [14:27:59] page fault, it took a bit [14:28:02] haha [14:28:22] yes yes I am going to follow your suggestion, I like it [14:31:03] oh, fstab can use any device specifier mount itself can user LABEL= UUID= etc etc, as long as they are unique [14:33:37] mforns: wanna pair on the event swap thing? [14:33:45] milimetric: sure! [14:33:47] bc? [14:34:09] yes, omw uh... give me 1 min. gonna get some water [14:35:12] me too [14:35:40] klausman: yep we currently use UUID but LABEL is way nicer! :) [14:41:22] 10Analytics-Radar, 10Better Use Of Data, 10Product-Infrastructure-Data, 10Product-Infrastructure-Team-Backlog, and 4 others: Documentation of client side error logging capabilities on mediawiki - https://phabricator.wikimedia.org/T248884 (10LGoto) [15:10:40] joal: https://cwiki.apache.org/confluence/display/BIGTOP/Bigtop+1.4.0+Release - I see alluxio mentioned in there, just realized :) [15:14:06] 1.9T of backup of 1006 done, only 1.7 more to go! [15:36:36] cookbook works! Just added all the 22 partitions on an-worker1096 [15:39:04] dsaez: sorry I got distracted, I see test-diego in https://analytics.wikimedia.org/published/datasets/one-off/ [15:39:15] did you rsync it from another node? [15:54:19] 10Analytics-Radar, 10Better Use Of Data, 10Product-Infrastructure-Data, 10Wikimedia-Logstash, and 3 others: Documentation of client side error logging capabilities on mediawiki - https://phabricator.wikimedia.org/T248884 (10LGoto) [15:54:58] a-team: i have meeting with t & C today and need to miss standup [15:55:15] 10Analytics-Clusters, 10Analytics-Kanban: Create a cookbook to automate the bootstrap of new Hadoop workers - https://phabricator.wikimedia.org/T262189 (10elukey) a:03elukey [15:57:04] 10Analytics-Clusters, 10Analytics-Kanban: Create a cookbook to automate the bootstrap of new Hadoop workers - https://phabricator.wikimedia.org/T262189 (10elukey) The cookbook now works, I was able to add all the partitions on an-worker1096->1101. In the current version I forgot to add the journalnode partitio... [15:58:03] Hi folks - just joining back after kids [15:58:16] mforns: I bet you managed the job with klausman and elukey :) [15:58:25] ottomata: Let'xs talk after standup? [15:58:48] joal: the zombie job? yes [15:58:55] cool mforns [15:58:58] thanks for that :) [15:59:06] joal: ya, want to know if you have objections to https://phabricator.wikimedia.org/T263466 [15:59:09] thank you for discovering and pinging! [15:59:11] (am already implementing) [15:59:15] https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/629406 [16:37:35] elukey, really ? I don't see ... weird [16:37:49] 10Analytics, 10Machine Learning Platform (Research): Investigate formal test framework for Oozie jobs - https://phabricator.wikimedia.org/T213496 (10calbon) 05Open→03Declined [16:37:51] 10Analytics-Radar, 10Dumps-Generation, 10Machine Learning Platform, 10ORES: Produce dump files for ORES scores - https://phabricator.wikimedia.org/T209739 (10calbon) [16:38:41] elukey I just see folders and English Wikipedia Page Views by Topics.html [16:45:28] mforns: I'm gonna have some food and then keep patching up that data, I'll send an email or something with the result when it looks good [16:45:41] (but I don't think I need to take up your time with it) [16:45:57] milimetric: ok! let me know if you need a second pair of eyes! [17:03:44] dsaez: mmm can you try to force a refresh of the page without cache? [17:06:24] elukey: weird cache from my side, but you are right, from my phone I can see it. I've cleaned history and cookies on my browser and it is not refresing... anyhow, thank you, and sorry for bothering. [17:16:26] dsaez: nono don't worry, if you have troubles feel free to reach out, it might also be a varnish caching problem [17:20:33] elukey: in hadoop test cluster, what's the correct way to pull refinery code? [17:20:43] can we scap deploy there? [17:20:47] or just git pull [17:22:14] mforns: in theory scap should be able to deploy to analytics1030, the coordinator [17:22:19] git pull is also just fine [17:22:34] beware of the spam to analytics-alerts@ [17:22:39] oh ok [17:22:49] I mean git grep / replace etc.. [17:23:03] ottomata: [17:23:28] elukey: you mean alerts when deploying or when running the oozie jobs? [17:23:34] in regex yaml, with your idea, I'll have to put the regex of all hadoop (non gpu) workers right? [17:23:46] mforns: nono when oozie jobs fail [17:24:17] ok ok [17:41:18] 10Analytics, 10Event-Platform: Figure out where stream/schema annotations belong (for sanitization and other use cases) - https://phabricator.wikimedia.org/T263672 (10Ottomata) [17:41:37] 10Analytics, 10Event-Platform: Figure out where stream/schema annotations belong (for sanitization and other use cases) - https://phabricator.wikimedia.org/T263672 (10Ottomata) [17:41:39] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 6 others: Modern Event Platform: Schema Guidelines and Conventions - https://phabricator.wikimedia.org/T214093 (10Ottomata) [17:41:42] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata) [17:42:06] 10Analytics, 10Event-Platform: Figure out where stream/schema annotations belong (for sanitization and other use cases) - https://phabricator.wikimedia.org/T263672 (10Ottomata) [17:43:09] 10Analytics, 10Event-Platform: Figure out where stream/schema annotations belong (for sanitization and other use cases) - https://phabricator.wikimedia.org/T263672 (10Ottomata) [17:59:42] 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10serviceops, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10jijiki) @Milimetric that would be great, if it is not too much work, I would appreciate it. I will work on the varnish... [18:01:23] 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10serviceops, 10User-jijiki: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10jijiki) [18:19:13] a-team: was trying to find archiva setup instructions for Razzi and ran into this empty doc (and I forgot how to do it): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Archiva [18:19:23] currently looking around but if anyone knows it off the top of their head [18:20:01] milimetric: what do you mean with setup instructions? [18:20:32] I think this page has a bunch of it: https://wikitech.wikimedia.org/wiki/Archiva [18:20:55] elukey: I mean like setting up the refinery-source repo to build properly (so it knows where our archiva is and all that) [18:21:07] ahhh from the dev point of view [18:21:11] 10Analytics: Kerberos identity for razzi - https://phabricator.wikimedia.org/T263676 (10razzi) [18:21:25] 10Analytics, 10Analytics-Kanban: Check whether mediawiki production event data is equivalent to mediawiki-history data over a month - https://phabricator.wikimedia.org/T262261 (10Nuria) [18:21:44] milimetric: nothing shoudl be needed [18:21:48] milimetric: refinery-source should build as-is using archiva (or at least I think it should?) [18:21:57] milimetric: (cc razzI) the pom.xml [18:22:00] hm... I was just thinking... but it fails for Razzi with very obvious errors [18:22:07] has all info needed to pULL from archiva [18:22:12] like "LogHelper" not found [18:22:17] hm [18:22:18] maybe gitfat isn't set up [18:22:29] shouldn't be related [18:22:37] milimetric: the build fails? [18:22:40] what is the maven command milimetric, razzi ? [18:22:40] cc razzi [18:22:52] just mvn package fails (or mvn test or anything) [18:22:58] so I think it's git fat, one sec let us set that up [18:23:33] milimetric: git-fat is for refinery, not source IIUC [18:24:12] that makes sense... [18:24:13] milimetric, razzi - can you try with a clean before? [18:24:19] mvn clean package [18:24:20] Trying now [18:24:22] hm... so what's failing [18:24:46] LogHelper is internal to refinery-source (in refinery-core) [18:26:30] we have a bunch of confusing docs we should probably delete about deploying and updating .m2/settings for releases to archiva [18:26:36] (we don't do that anymore right?) [18:28:14] milimetric: we need that to deploy through manual maven deploy to archiva [18:28:47] oh I thought we always did the jenkins thing [18:29:04] milimetric: for refinery we do, for other repos not really [18:31:27] joal: ok cool `mvn clean package` worked [18:31:31] \o/ [18:32:44] razzi, milimetric: if a previous build has failed, the state of failed modules can be wrong - I always use clean before building (except when recompiling after success) [18:32:49] going afk, o/ [18:32:53] bye elukey [18:33:08] maybe we should just add "clean" as a step to package and test? [18:33:12] oh not test... [18:33:14] right... [18:33:19] ottomata: couldn't say in the metting but: I love your flying bikes :) [18:34:03] haha [18:34:07] you want them? [18:34:14] joal actually, one of them is fabians! :p [18:34:26] :) [18:34:38] milimetric: feasible (http://maven.apache.org/plugins/maven-clean-plugin/usage.html) [18:34:40] razzi: i just finished with meetings for the day! let me know if there is anything i can help you with! [18:34:57] ok meetings done for me as well - gone for tonight :) [18:35:05] o/ [18:35:07] laters joal! [18:36:28] ottomata: I'm about to get lunch, then I'll be trying to experiment with oozie on the hadoop test cluster for https://phabricator.wikimedia.org/T262660. Think you could help with that? [18:37:26] 10Analytics, 10Event-Platform: Figure out where stream/schema annotations belong (for sanitization and other use cases) - https://phabricator.wikimedia.org/T263672 (10Ottomata) [18:37:39] ya for sure [18:38:50] Cool. Catch you in a bit [18:38:53] razzi: https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Hadoop_testing_cluster#Map_of_the_testing_hosts to get a list of hosts and what they do [18:39:02] elukey, thanks! [18:39:46] (03CR) 10Ottomata: [C: 04-1] "Given the discussion in https://phabricator.wikimedia.org/T263672 we should probably hold off on this, don't bother reviewing yet." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629406 (https://phabricator.wikimedia.org/T263466) (owner: 10Ottomata) [18:51:57] 10Analytics-Radar, 10Machine Learning Platform, 10revscoring, 10artificial-intelligence: [Investigate] Hadoop integration for ORES training - https://phabricator.wikimedia.org/T170650 (10ACraze) 05Open→03Resolved [18:56:44] 10Analytics, 10Analytics-Kanban, 10Privacy Engineering, 10Product-Analytics, and 3 others: Drop data from Prefupdate schema that is older than 90 days - https://phabricator.wikimedia.org/T250049 (10Milimetric) Ok, data is swapped. @nettrom_WMF if you don't mind, could you check out both `event.PrefUpdate`... [19:01:14] 10Analytics-Radar, 10Better Use Of Data, 10Product-Infrastructure-Data, 10Wikimedia-Logstash, and 4 others: Documentation of client side error logging capabilities on mediawiki - https://phabricator.wikimedia.org/T248884 (10Ottomata) (Oops wrong Bug: # ^) [19:02:12] 10Analytics, 10Analytics-Kanban, 10User-jijiki: pageview=0 in X-Analytics supersedes anything else - https://phabricator.wikimedia.org/T263683 (10Milimetric) [19:16:54] razzi, I will take a pause in short and then be back to work for a bit more, do you want to pair again? [19:18:46] mforns: Yeah, ping me when you're back [19:21:17] ok! [19:23:26] nuria: check out T263683, just added, should be quick and looks useful for a debug case that's coming up soon-ish [19:23:27] T263683: pageview=0 in X-Analytics supersedes anything else - https://phabricator.wikimedia.org/T263683 [19:24:26] milimetric: you can send pageview=preview [19:24:30] and that woudl happen [19:24:36] milimetric: it will not be counted [19:24:39] milimetric: right? [19:25:11] I thought that... but then it would be kind of confusing, it's not really a preview, it should just be dropped entirely, I didn't want to overload that semantic [19:25:40] (but code-wise, yeah, we can treat it the same way, basically add pageview=preview || pageview=0 [19:25:42] ) [19:25:53] or pageview=debug? [19:25:56] maybe that's better [19:26:07] nuria: ^ [19:29:07] milimetric: my opinion would be that the least we modify that code for special casing the best so if we can use the existing preview that , i think, should be fine [19:29:59] nuria: but then what if someone tries to analyze pageview=preview independent of our tooling, they wouldn't know about this case [19:30:29] I am fairly strongly opposed to that overloading, in other words [19:30:55] (just because I feel like I'd have no way to communicate it to a data analyst / researcher looking at that field) [19:30:59] milimetric: that is not a use case we have had in 5 years of adding that preview marker [19:31:09] milimetric: so i'd say it is pretty unlikey [19:31:14] * unlikely [19:31:29] how do we know? people can write arbitrary queries, maybe they've looked at requests that way [19:32:08] milimetric: no, i disagree that is realistic [19:32:37] milimetric: I am not sure if you are thinking that changes are only in the pageview definition [19:32:45] milimetric: they woudl need to happen on varnish as well [19:32:47] *would [19:33:08] yeah, effie's doing those [19:33:36] milimetric: my thought on that is that we should not really have any code on varnish like that at all , the fact that it exists is a smell [19:33:50] milimetric: and we should not add to it [19:34:05] milimetric:the cookie setting i get [19:34:12] milimetric: cause there is no other way [19:34:44] milimetric: but tagging requests is probably something that - if possible- when we migrate to ats we shoudl think of removing at some point [19:34:54] milimetric: we can consult the team and see what others think [19:36:12] sure, I am happy to think of better ways, I'll tell effie to pause for a moment. The problem is, basically, how do you allow external debug requests that would land in wmf_raw.webrequest and make sure they don't affect any of our analyses (I'll explain this in the task and we can triage it tomorrow) [19:37:11] 10Analytics, 10Analytics-Kanban, 10User-jijiki: pageview=0 in X-Analytics supersedes anything else - https://phabricator.wikimedia.org/T263683 (10Milimetric) @jijiki in discussing this with the team we want to brainstorm about it a bit. Some think there might be a better way. Give us until end of day tomor... [19:39:29] 10Analytics, 10Analytics-Kanban, 10User-jijiki: Mechanism to flag webrequests as "debug" - https://phabricator.wikimedia.org/T263683 (10Milimetric) [19:42:55] 10Analytics: Kerberos identity for razzi - https://phabricator.wikimedia.org/T263676 (10razzi) 05Open→03Resolved a:03razzi Turns out I already had kerberos, I just forgot where I put my password :) [19:51:54] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10Ottomata) Hi @egardner! You've got everything right! I'll respond to a... [19:58:29] * ottomata forgot to change my nick back oops [19:58:46] milimetric: just read that ticket, why are there varnish changes needed? [19:59:11] couldn't effie just set that pageview=0 in the X-Analytics header when doing those types of tests? [19:59:22] also...does the pageview pipeline respect that if it is set externally? [19:59:43] maybe it reasons about pageview=1, but will it avoid tagging something that looks like a desktop pageview as a pageview if pageview=0? [20:00:52] ottomata: I'm not sure why, but I'm fairly certain effie knows one way or the other, but yeah, someone could just set whatever they need in X-Analytics (not sure if the varnish code overwrites it, maybe that's the problem) [20:01:25] as for the pageview definition, I've no idea what would happen if you sent pageview=0, but I think it would just ignore it and categorize based on the other data [20:13:28] hey razzi, I'm back :] [20:13:39] mforns: cool, cya in the batcave? [20:13:43] ok [20:15:40] ottomata: Feel free to join the batcave and talk oozie config with mforns and I [20:20:56] oh be there in 2! [20:25:12] (03CR) 10Jenniferwang: "> Patch Set 3: Code-Review+1" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628235 (https://phabricator.wikimedia.org/T262499) (owner: 10Jenniferwang) [21:36:13] milimetric:, ottomata : varnish does not pass whatever is set in x-nalytics [21:36:19] *x-analytics [21:37:10] milimetric: if you send pageview=0 nothing happens [21:37:17] milimetric: it does not get passed on [21:37:54] milimetric: [21:37:57] https://www.irccloud.com/pastebin/pDsuPqAy/ [21:41:48] (03CR) 10Nuria: [V: 03+2 C: 03+2] Add SpecialMuteSubmit schema to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628235 (https://phabricator.wikimedia.org/T262499) (owner: 10Jenniferwang) [21:50:11] 10Analytics, 10Event-Platform: Q2 goal. Deploy the canary event monitoring for some event streams - https://phabricator.wikimedia.org/T263696 (10Nuria) [21:56:28] 10Analytics, 10Operations, 10serviceops-radar, 10Article-Recommendation, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Dzahn) [22:00:05] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Add more popular articles per country data to AQS - https://phabricator.wikimedia.org/T263697 (10Nuria) [22:02:02] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Add more popular articles per country data to AQS - https://phabricator.wikimedia.org/T263697 (10Nuria) [22:48:27] 10Analytics, 10Analytics-Data-Quality: Import of MediaWiki tables into the Data Lakes mangles usernames - https://phabricator.wikimedia.org/T230915 (10Nuria) a:03lexnasser