[00:11:12] <Jdlrobson>	 hey sorry, not been feeling too great today so ive been a bit on/off
[00:11:25] <Jdlrobson>	 im not working on vagrant no
[00:59:52] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Wikistats Bug - inaccurate lists for top editors - https://phabricator.wikimedia.org/T258233 (10Quiddity) 05Open→03Invalid Confirmed! My mistake. Sorry I overlooked that (now obvious!) detail.
[01:10:49] <wikibugs>	 10Analytics, 10Analytics-EventLogging: Events being lost in Chrome when navigating to an external URL - https://phabricator.wikimedia.org/T258513 (10Nuria) Ok, i did tested with netcat in vagrant and I see the event being sent on this case:   `  $('<a>').attr('href', 'https://wikimedia.org').on( 'click', funct...
[06:36:34] <wikibugs>	 10Analytics-Clusters, 10Operations, 10ops-eqiad, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) >>! In T236327#6324507, @Jclark-ctr wrote: > @elukey   can you let me know your availability for  scheduling this project?  Any time that you are...
[08:09:15] <elukey>	 !log turnilo.wikimedia.org migrated to CAS
[08:09:17] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:22:54] <wikibugs>	 10Analytics-Clusters, 10Discovery: Move mjolnir kafka daemon from ES to search-loader VMs - https://phabricator.wikimedia.org/T258245 (10elukey) @EBernhardson @RKemper if you have time this/next week do you think that we could prioritize this task? I am asking since I'd love to add ferm rules to Kafka Jumbo as...
[09:44:59] <wikibugs>	 (03CR) 10Joal: "sorry for that :( Thanks a lot mforns for the fix" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/615274 (owner: 10Mforns)
[10:00:05] <wikibugs>	 10Analytics: Data Lake incremental Data Updates - https://phabricator.wikimedia.org/T258511 (10JAllemandou)
[10:29:45] * elukey lunch!
[12:48:41] <ottomata_>	 dunno why i can't identify as ottomata
[12:48:55] <ottomata_>	 joal: https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/615415
[12:48:56] <ottomata_>	 !
[12:48:58] <ottomata_>	 pretty cool!
[12:49:11] <ottomata_>	 can you check that and make sure the revert field names make sense?
[12:52:21] <RhinosF1>	 ottomata_: /ns regain ottomata
[12:52:41] <ottomata>	 RhinosF1:  that's it?!
[12:52:43] <ottomata>	 weird
[12:53:36] <RhinosF1>	 ottomata: if /nick ottomata fails, /ns regain will work as it kicks any other connections off it.
[12:53:50] <RhinosF1>	 And removes any hold by NickServ
[12:53:57] <ottomata>	 can anyone just do that?
[12:54:40] <elukey>	 in theory you need to put you credentials when doing it
[12:54:52] <RhinosF1>	 ottomata: only when idenitified to your account, which you was
[12:55:18] <ottomata>	 huh intersting, ok thank you!
[12:55:47] <RhinosF1>	 No problem!
[12:56:56] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Refine - Quote SQL columns used in selectExpr in TransformFunctions [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/615231 (https://phabricator.wikimedia.org/T255818) (owner: 10Ottomata)
[13:03:16] <wikibugs>	 10Analytics, 10Analytics-EventLogging: Events being lost in Chrome when navigating to an external URL - https://phabricator.wikimedia.org/T258513 (10jlinehan) >>! In T258513#6324815, @Nuria wrote: > Now, probably someone should also confirm my findings to make sure I am not totally off.  Yes, good, good, was a...
[13:16:07] <joal>	 ottomata: i added 2 comments - thanks for linking me in
[13:20:00] <wikibugs>	 (03PS1) 10Gehel: Introduce Takari Maven Wrapper. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/615481
[13:23:12] <wikibugs>	 (03PS1) 10Gehel: Introduce Maven sortpom plugin. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/615482
[13:25:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Introduce Maven sortpom plugin. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/615482 (owner: 10Gehel)
[13:30:44] <wikibugs>	 (03PS1) 10Gehel: Use properties to configure compiler source and target versions. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/615485
[13:31:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Use properties to configure compiler source and target versions. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/615485 (owner: 10Gehel)
[13:33:00] <wikibugs>	 (03CR) 10Gehel: Use properties to configure compiler source and target versions. (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/615485 (owner: 10Gehel)
[13:33:31] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "Looks like there are some strange depencency ordering constraints. This smells like jar hell!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/615482 (owner: 10Gehel)
[13:36:51] <elukey>	 gehel: o/ - thanks a lot for the refinery patches, but be aware that we'll probably not merge anything until we have the whole team back from the summer holidays (to avoid weird issues while in reduced capacity :( )
[13:37:12] <gehel>	 elukey: sure, I'm just having fun :)
[13:37:43] <elukey>	 gehel: please keep going :) it seems worth to be tracked in a task
[13:38:50] <elukey>	 our pom.xmls surely need some love
[13:44:25] <wikibugs>	 10Analytics, 10Event-Platform, 10Product-Analytics (Kanban): Product Analytics to review & provide feedback for Event Platform Instrumentation How-To - https://phabricator.wikimedia.org/T253269 (10mpopov)
[13:44:29] <wikibugs>	 10Analytics, 10Event-Platform, 10Product-Analytics (Kanban): Product Analytics to review & provide feedback for Event Platform Instrumentation How-To - https://phabricator.wikimedia.org/T253269 (10mpopov)
[13:50:26] <wikibugs>	 10Analytics, 10Event-Platform, 10Product-Analytics (Kanban): Product Analytics to review & provide feedback for Event Platform Instrumentation How-To - https://phabricator.wikimedia.org/T253269 (10mpopov) Have pinged everyone else about the status of their review, if they still want to review that how-to. Mi...
[14:03:24] <mforns>	 joal: if you're there: if we don't use the cassandra bundle, could we drop its code?
[14:04:40] <joal>	 mforns: if you want :)
[14:05:08] <mforns>	 joal: yes? :D do I have your blessing?
[14:06:13] <joal>	 mforns:yes yes! the constrqint of having to restart it on 1st of month is too cumbersome
[14:06:43] <mforns>	 ok, will do in a separate change. otherwise I have to add it for the new geoeditors job
[14:12:26] <ottomata>	 gehel:  we still want to do  that shaded jar change you recommened
[14:12:29] <ottomata>	 we just haven't because we are lazy
[14:15:30] <milimetric>	 joal: lol, I'm so sad about https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/615415/
[14:15:58] <milimetric>	 they did all our work, we should find something else to play with incrementals on
[14:16:01] <joal>	 milimetric: how so? this is actually great!
[14:16:11] <joal>	 hehehe :)
[14:16:14] <milimetric>	 :)
[14:16:21] <joal>	 milimetric: not yet as you might have noticed
[14:16:39] <milimetric>	 well, in a week the other patch lands which gets us the set of reverted edits
[14:16:53] <milimetric>	 and we'll have to work out the duration thing but that seems feasible too
[14:17:24] <milimetric>	 so that's pretty much everything that would've been hard to update.  I suppose we still have to propagate the updates to the old data, but computation is mostly done by mediawiki now
[14:17:35] <joal>	 we need/wish to update reverted revisions - inside-page revision match and update - very good test for the data-mutation aspect of the incremental backend
[14:17:50] <milimetric>	 yeah, the data mutation stays
[14:17:53] <milimetric>	 but no computation
[14:18:29] <joal>	 milimetric: which in any case woumd be the most complicated aspect (we already have the algo that does revert match/check)
[14:19:10] <milimetric>	 well, the extra awesome here is that they're giving us non-exact reverts!
[14:19:30] <joal>	 milimetric: IMO we'd still need to apply an algo that would look like the current version of it - The only changes are the checks for shas not needed anymore, and actually more value being brought by reverts being not identity-only
[14:19:35] <milimetric>	 but yes, otherwise we have the algorithm, I just wasn't sure about the scale and performance of that part
[14:19:36] <joal>	 ideed :)
[14:19:58] <milimetric>	 (because we currently do tricks like partitioning by year, so I didn't know whether we can compute as we see each event)
[14:19:59] <joal>	 milimetric: this concurrent writing of the same ideas means we are on the same page I guess :)
[14:20:21] <joal>	 milimetric: parttition by year is gone actually!
[14:20:27] <milimetric>	 oh?!
[14:21:07] <joal>	 milimetric: we use secondary-sorting, meaning sorting with disk-spill, therefore no need to split by year
[14:21:25] <milimetric>	 ah, very cool, didn't realize that (or forgot if I did at some point)
[14:21:31] <milimetric>	 thanks!
[14:26:33] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1095 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/l 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[14:27:31] <wikibugs>	 (03PS1) 10Mforns: Remove Cassandra bundle since not used [analytics/refinery] - 10https://gerrit.wikimedia.org/r/615499 (https://phabricator.wikimedia.org/T248289)
[14:35:52] <elukey>	 https://grafana.wikimedia.org/d/000000585/hadoop?panelId=25&fullscreen&orgId=1&from=now-30d&to=now
[14:36:00] <elukey>	 we are crossing again the 2PB mark on hdfs
[14:36:25] <elukey>	 https://grafana.wikimedia.org/d/000000585/hadoop?panelId=103&fullscreen&orgId=1&from=now-24h&to=now
[14:36:29] <wikibugs>	 (03PS6) 10Mforns: Configure Oozie job for loading geoeditors data into Cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/582638 (https://phabricator.wikimedia.org/T248289) (owner: 10Lex Nasser)
[14:36:51] <dsaez>	 hey a-team, do you have any pointer to see how to load the Wikitext avro files? hive is taking too long. I've already loaded the jars. But then I don't know how to do the
[14:36:51] <dsaez>	 spark.read. ... ? I've seen examples using .format('databricks.avro') ..but it doesn't work. And I couldn't find any example nor wikitech or your gists ...
[14:37:37] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Radar, 10Discovery: SearchSatisfaction has validation errors for event.query - https://phabricator.wikimedia.org/T257331 (10Milimetric)
[14:37:41] <joal>	 Hi dsaez
[14:37:52] <dsaez>	 hi joal!
[14:39:18] <wikibugs>	 (03PS7) 10Mforns: Configure Oozie job for loading geoeditors data into Cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/582638 (https://phabricator.wikimedia.org/T248289) (owner: 10Lex Nasser)
[14:39:26] <joal>	 dsaez: When you say hive is taking too long, let's discuss the analysis you're doingto see if spark would really speed up things
[14:39:59] <wikibugs>	 (03PS8) 10Mforns: Configure Oozie job for loading geoeditors data into Cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/582638 (https://phabricator.wikimedia.org/T248289) (owner: 10Lex Nasser)
[14:40:06] <joal>	 dsaez: I can also very much exwplain how to setup a specific spark kernel that includes jars allowing you to read avro
[14:40:24] <dsaez>	 joal, a simple string matching, on the wikitext_history , just for 'enwiki' 
[14:40:47] <elukey>	 so /user/analytics/.Trash/200714000000/wmf/data/wmf/pageview/actor is 18T (replicated) on HDFS, can we drop it safely?
[14:40:56] <elukey>	 (trying to free the usual suspects)
[14:41:45] <dsaez>	 joal: https://paste.ofcode.org/SNC6t2SWvY94wpA4Ej2NmH
[14:42:25] <dsaez>	 that is taking super long, and I needed to increase the driver, overhead and workers memory a lot 
[14:43:42] <elukey>	 17.7 T  53.0 T  /var/log/hadoop-yarn
[14:43:45] <elukey>	 also interesting
[14:43:46] <joal>	 dsaez: I'd suggest rewriting the filter using a RLIKE with a case-insensitive flag
[14:43:59] <joal>	 dsaez: computing lower on all texts is big
[14:44:38] <joal>	 dsaez: then I'm not surprised this is long - parsing all text is long :)
[14:44:53] <dsaez>	 oh, I see, RLIKE was also sending a overhead error
[14:44:58] <joal>	 dsaez: I assume it's important you look at historical qnd not only xcurrent?
[14:45:04] <dsaez>	 exactly.
[14:45:10] <joal>	 hm
[14:45:12] <dsaez>	 because after that I also do: https://paste.ofcode.org/uRVHxgQnVXBpdcJRHvvpMP
[14:45:24] <dsaez>	 and that is also taking a lot. And that is a simple join.
[14:45:45] <elukey>	 joal: is there anything under /wmf that is droppable? I recall the last time that you brought up some snapshots that could have been removed
[14:46:46] <dsaez>	 btw, I now the solution could be to use what you explained me in the last meeting, that special dataset with the parallel data with parent_id, but I thougth this should be so expensive
[14:48:36] <joal>	 checking elukey 
[14:49:05] <wikibugs>	 10Analytics-Radar, 10Operations, 10Patch-For-Review: Move yarn.wikimedia.org to a separate Buster VM - https://phabricator.wikimedia.org/T258152 (10MoritzMuehlenhoff) 05Open→03Resolved Yarn us now running on a separate Ganeti VM using Buster (an-tool1008.eqiad.wmnet)
[14:49:16] <elukey>	 !log hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/analytics/logs/*
[14:49:19] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:49:33] <joal>	 dsaez: that join is extremely costly
[14:51:14] <joal>	 elukey: we could drop one wikitext history snapshot
[14:51:41] <elukey>	 joal: if not a problem please do :)
[14:53:30] <joal>	 dsaez: can you please move to using snapshot 2020-04? I'm gonna drop the previous one to free space
[14:53:55] <dsaez>	 joal: sure. 
[14:54:22] <elukey>	 dsaez: all the blame to the ops people, always complaining and asking to drop data :D
[14:54:29] <dsaez>	 joal: any suggestion in how to make that join ligheter and more important how to evaluate the join cost?
[14:55:12] <dsaez>	 elukey haha... no no... in revenge I came and ask silly questions to all of you, which I suppose to know the answer
[14:58:54] <dsaez>	 elukey, btw, have you seen this https://phabricator.wikimedia.org/T258087 ? we have already tried to by reseting the 'venv', deleting .local, but without success 
[14:59:58] <elukey>	 dsaez: ah snap please either use the Analytics tag or Analytics-clusters, otherwise we don't get it in our board :(
[15:00:44] <milimetric>	 a-team: standup?
[15:00:48] <elukey>	 dsaez: are you folks able to reach the login, and see the jupyter interface to launch kernels etc.. ?
[15:00:49] <joal>	 Joining!
[15:00:52] <wikibugs>	 10Analytics-Clusters, 10Jupyter-Hub: Timeout during relaunch Jupyterhub server - https://phabricator.wikimedia.org/T258087 (10diego)
[15:00:54] <elukey>	 milimetric: yep!
[15:01:08] <milimetric>	 am I in an alternate universe or something?
[15:01:13] <jayme>	 ottomata: could you help out to verify if eventstreams in codfw is working? It looks as if the clients don't come back https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams?orgId=1&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&var-service=eventstreams
[15:01:20] <elukey>	 !log hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/analytics-privatedata/logs
[15:01:21] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:01:48] <dsaez>	 elukey: it is just in rodolfo's (intern) account, but, no, he just is able to loggin and gets that error
[15:02:03] <elukey>	 interesting, will check after meetings
[15:03:08] <joal>	 !log Manually drop /wmf/data/wmf/mediawiki/wikitext/history/snapshot=2020-03 to free some spqce
[15:03:09] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:03:18] <jayme>	 ottomata: as I said it, now they're coming back. Thanks :D
[15:05:08] <joal>	 elukey: dropping the old pageviez-actor
[15:05:08] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1095 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[15:05:50] <joal>	 !log manually drop /user/analytics/.Trash/200714000000/wmf/data/wmf/pageview/actor to free some space
[15:05:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:06:17] <dsaez>	 thx elukey!
[15:06:26] <elukey>	 joal: thanks!
[15:06:32] <joal>	 dsaez: I'll try to explain why the join is costly just after standup :)
[15:06:58] <ottomata>	 jayme:  aye the connections are long lived,  but ats is timing them out after 15  minutes
[15:07:20] <joal>	 dsaez: basically you're getting a bunch of revs through your filter (that's already expensive, since it needs to read a lot of text) - Then you join to the all rev-text 
[15:07:26] <ottomata>	 so it  probably will take that long for them to be rebalanced if e.g codfw was depooled and repooled
[15:07:50] <joal>	 that last bit means shuffling all rev-text, therefore duplicating the content
[15:08:45] <joal>	 if you expect the number of filtered revs to be relatively small, you could generate a filter based on that instead of doing a join
[15:08:51] <joal>	 dsaez: --^
[15:09:13] <jayme>	 ottomata: okay, thanks. I was doing a deployment (for envoy upgrade) in codfw and will do so in eqiad as well
[15:09:37] <dsaez>	 joal: there are 5K ids. By filter you mean a WHERE or something like that?
[15:10:37] <dsaez>	 the part that I don't get is when the job is done just in the column, and when it involves the full row, I thought that the join above was just working on the revision_id col
[15:11:37] <dsaez>	 joal: forget about the filter question, I'll check that in Google
[15:12:03] <joal>	 dsaez: indeed, i mean generating a where clause with the (k ids
[15:12:07] <joal>	 5k sorry
[15:12:27] <joal>	 By doinf that, no suffle is needed, therefore no data duplication
[15:12:34] <joal>	 dsaez: --^
[15:14:14] <elukey>	 joal: between me and you we dropped 5M files :O
[15:14:32] <joal>	 elukey: that's a lot
[15:14:33] <elukey>	 that is a ton for the HDFS namenode's heap
[15:15:22] <dsaez>	 I see, I'll research more about that shuffling and data duplication. I remember that I use to have similar problem with the  ReduceByKey (or something like )in Spark 1.X
[15:16:13] <elukey>	 joal: https://grafana.wikimedia.org/d/000000585/hadoop?panelId=92&fullscreen&orgId=1&from=now-3h&to=now
[15:16:24] <joal>	 I waslooking at that :)
[15:16:33] <elukey>	 :D
[15:20:26] * dsaez just discovered grafana
[15:26:02] <dsaez>	 joal: your suggestion complete solved the problem 5 mins to run the filter, against 12 hours for the join. Mental note: avoid join
[15:27:39] <joal>	 dsaez: let's discuss the assumption here :)
[15:28:26] <joal>	 dsaez: joins are costly for big stuff as data needs to be shuffles (meaning duplicated) - for smallish stuff, no need to avoid them
[15:28:58] <joal>	 for instance, using ids only on columnar storage, joining should be ok
[15:29:24] <joal>	 however, the join here is containing text, making shuffled data huge
[15:29:56] <joal>	 dsaez: something else to keep in mind: filters work up to a certain number of rows - let's say ten's of thousand, max
[15:30:18] <joal>	 if you have more rows to filter, I'm afraid you'll have to go for joins
[15:30:30] <joal>	 dsaez: --^
[15:30:46] <joal>	 dsaez: I'm happy nonetheless your problem got solved :)
[15:30:54] <dsaez>	 joal: got it.  Thx. 
[15:31:25] <dsaez>	 joal: And about spark vs hive. When worth to write spark vs hive?
[15:32:04] <joal>	 dsaez: in most cases when the jobs have multiple steps (as here: filter + join, or filter + collect + filter)
[15:32:20] <joal>	 When the job has a single step, both should be similar
[15:32:39] <joal>	 dsaez: I still recommend using spark in all cases (easier to program, etc)
[15:36:56] <dsaez>	 joal: my usual work flow is A = spark.sql( Select something) ; B= spark.sql(Select anothething), and then work with spark commands later. For example A.join(B). There is any adventage or difference in instead of doing that, just write the full spark.sql(SELECT ... JOIN ...) and so on?
[15:37:35] <joal>	 dsaez: both ways are the same in term of spark execution
[15:37:39] <dsaez>	 and second question. There is any difference in gathering the data with spark.sql, instead of doing spark.read.parquet (or avro)... ?
[15:38:04] <joal>	 some gain can be made using cache if for instance you do some analysis on A before doing the join (and data is not big)
[15:38:37] <joal>	 dsaez: second question: no diff as long as partitions are setup the same way
[15:39:17] <dsaez>	 joal, got it
[15:39:33] <joal>	 for instance dsaez: A = spqrk.sql(...).cache  -- A.count() -- A.distinct.count() etc, then B and join --> A being cahced helps
[15:42:11] <dsaez>	 oh, I didn't know that cache, is the same that checkpoint?
[15:42:31] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Release-Engineering-Team, 10dev-images, 10Patch-For-Review: EventLogging dev image should have verbose output enabled - https://phabricator.wikimedia.org/T257378 (10Milimetric) 05Declined→03Open p:05Triage→03High my bad, we were speed-grooming.  This is ac...
[15:43:50] <wikibugs>	 10Analytics, 10Analytics-EventLogging: Events being lost in Chrome when navigating to an external URL - https://phabricator.wikimedia.org/T258513 (10Nuria) @Jdlrobson Please look at comment above: {T258513#6324815}. The event is being sent, but is is not logged in chrome network tab * in the case of the cross...
[15:54:50] <wikibugs>	 10Analytics, 10Analytics-EventLogging: Events being lost in Chrome when navigating to an external URL - https://phabricator.wikimedia.org/T258513 (10Jdlrobson) That's reassuring to know. My understanding was @Mayakp.wiki was saying that she wasn't seeing events in the database itself that should have been sent...
[15:58:37] <joal>	 dsaez: cache is asking spark to keep stuff in ram, allowing for faster queries when dataframes are reused
[15:59:05] <dsaez>	 joal: got it, checkpoint saves on HDD, true?
[16:00:17] <wikibugs>	 10Analytics, 10Analytics-EventLogging: Events being lost in Chrome when navigating to an external URL - https://phabricator.wikimedia.org/T258513 (10Nuria) 05Open→03Resolved
[16:02:18] <wikibugs>	 10Analytics, 10Analytics-EventLogging: Events being lost in Chrome when navigating to an external URL - https://phabricator.wikimedia.org/T258513 (10Nuria) @Jdlrobson Please do test overriding url and seeing whether event gets sent to a local port and let us know what you find either here or in irc (once you a...
[16:05:25] <joal>	 dsaez: yes
[16:05:51] <joal>	 ok team - gone for now - See you in a couple weeks :)
[16:06:04] <elukey>	 joal: o/ - can I ask one last thing?
[16:06:31] <elukey>	 I am wondering what uses the parquet/avro extensions in Druid (what jobs to test etc..)
[16:06:42] <elukey>	 if you are gone np :)
[16:07:39] <joal>	 hm elukey - I can't recall :(
[16:08:05] <joal>	 elukey: we don't use avro IIRC, only parquet, and only for a small number of jobs
[16:08:38] <elukey>	 joal: ack will look for them, thanks!
[16:08:51] <joal>	 elukey: trying to find one :)
[16:09:03] <elukey>	 nono please you are free to go :)
[16:09:20] <joal>	 elukey: medawiki-history-reduced
[16:09:27] <joal>	 not nice to use as test :(
[16:09:45] <elukey>	 ahhahaha
[16:10:27] <joal>	 elukey: easiest to test would be to use the webrequest table (stored in parquet) and devise a dedicated json load
[16:11:28] <joal>	 elukey: you can take my example and add the parquet loading bit ('"inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat"' in inputSpec, and change the path
[16:11:41] <joal>	 I think it's the easiest
[16:11:48] <joal>	 elukey: --^
[16:12:26] <elukey>	 +1 thanks!
[16:12:44] <joal>	 np elukey :)
[16:12:58] <joal>	 In any case you can find me on personal qccount in case of emergency :)
[16:13:04] <joal>	 Bye!
[16:13:05] <elukey>	 o/
[16:23:48] * elukey brb
[16:47:34] <wikibugs>	 (03CR) 10Nuria: [C: 03+2] Remove Cassandra bundle since not used [analytics/refinery] - 10https://gerrit.wikimedia.org/r/615499 (https://phabricator.wikimedia.org/T248289) (owner: 10Mforns)
[17:01:19] <wikibugs>	 (03CR) 10Nuria: [V: 03+2 C: 03+2] Remove Cassandra bundle since not used [analytics/refinery] - 10https://gerrit.wikimedia.org/r/615499 (https://phabricator.wikimedia.org/T248289) (owner: 10Mforns)
[17:20:54] <wikibugs>	 10Analytics, 10Research, 10Research-collaborations: Performance Issues when running Spark/Hive jobs via Jupyter Notebooks - https://phabricator.wikimedia.org/T258612 (10Ottomata)
[17:28:53] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Technical contributors emerging communities metric definition, thick data - https://phabricator.wikimedia.org/T250284 (10Nuria) We were thinking of plotting several wikis on same graph with relative times so we can see  editors/articles content namepsace...
[17:30:45] * elukey off!
[17:33:11] <gehel>	 ottomata, joal: I'm having a look at duplicated dependencies in refinery. I'm surprised this thing works in a semi predictive way.
[17:35:08] <ottomata>	 hahah
[17:35:24] <ottomata>	 gehel: i usually close my eyes when i see those warnings
[17:35:50] <gehel>	 I have no idea what are the classes actually used in production :)
[17:35:58] <gehel>	 and I doubt that anyone can
[17:36:06] <gehel>	 and it's probably not entirely predictable
[17:55:10] <wikibugs>	 (03PS2) 10Gehel: Use properties to configure compiler source and target versions. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/615485
[17:56:07] <nuria>	 gehel: ayayaya
[17:57:50] <nuria>	 gehel:  is the suggestion we specify versions for every one transitive dep?  (asking honestly)
[17:58:09] <wikibugs>	 (03PS9) 10Mforns: Configure Oozie job for loading geoeditors data into Cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/582638 (https://phabricator.wikimedia.org/T248289) (owner: 10Lex Nasser)
[18:11:48] <wikibugs>	 10Analytics, 10Analytics-Kanban: Update PageviewDefinition to only include /api/rest_v1/page/mobile-html requests with X-Analytics: pageview=1 in pageviews - https://phabricator.wikimedia.org/T257860 (10Milimetric) a:03Nuria
[18:51:57] <ottomata>	 AH Ha i figured out why my computer is so slow
[18:52:07] <ottomata>	 either it or the smart hub i use to connect to my monitor IS TOO HOT
[18:52:16] <ottomata>	 it is TOO HOT in my apartment to run my laptop witih a monitor
[18:52:29] <ottomata>	 been putting ice packs on things and thigns are better now
[18:52:55] <ottomata>	 sheesh that took my 1.5 hours to figure out
[20:52:54] <bd808>	 https://meta.wikimedia.org/wiki/Tech#Weird_increase_in_pageviews_about_Wikimedia_projects
[20:53:18] * bd808 can't remember why he has that page watchlisted
[21:06:47] <nuria>	 ottomata: that is  amazing in a bad short of way
[21:06:53] <nuria>	 bd808: looking
[21:09:38] <nuria>	 bd808: probably bot traffic but in too low numbers to be amrked as such
[21:09:42] <nuria>	 *marked
[21:11:47] <bd808>	 ack. the order of magnitude jump in en.wikibooks [[Main Page]] caught my eye
[21:12:03] <nuria>	 bd808: let me look at one specifically
[21:12:46] <nuria>	 bd808: mmm, ya, on that one the jump IS SIGNIFICANt
[21:12:57] <nuria>	 bd808: one sec , doing magic
[21:13:31] * bd808 distracts the audience from nuria as she conjures truth from a pile of numbers
[21:14:09] <nuria>	 bd808: a KEY task 
[21:49:22] <nuria>	 bd808: https://usercontent.irccloud-cdn.com/file/XOFVQVPK/Screen%20Shot%202020-07-22%20at%202.48.44%20PM.png
[21:50:38] <nuria>	 bd808: looks like all pageviews are in desktop from russia/india and linux which screams bot , now at a rate at which traffic is not flagged as such
[21:51:04] <nuria>	 bd808: have handy notebook in stat1007 that you can use if you ever want to do THE MAGIC
[21:55:43] <wikibugs>	 (03CR) 10Nuria: "Can we add this convention and best norms around using it to README? (https://github.com/wikimedia/analytics-refinery-source)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/615481 (owner: 10Gehel)
[22:07:04] <wikibugs>	 10Analytics, 10Performance-Team: Validation rules on eventgate should take max int values into account in order to validate data for an schema - https://phabricator.wikimedia.org/T258659 (10Nuria)