[06:47:54] <joal>	 Good morning
[07:53:13] <elukey>	 morning!
[07:53:20] <elukey>	 late start for me this morning :)
[07:53:32] <joal>	 :)
[07:54:56] <joal>	 I'm still in fight elukey - Obviously when all was good yesterday, I found the hidden plague :S
[07:55:00] <elukey>	 so joal another person from imply.io answered in druid-users, but he asked if we disable the datasource first etc.. linking the tutorial that you followed :)
[07:55:11] <joal>	 :)
[07:55:19] <elukey>	 joal: with mediawiki-history?
[07:55:23] <joal>	 yessir
[07:57:52] <elukey>	 ah snap
[07:58:14] <joal>	 elukey: I'm happy to have found it before deploy, but still, not cool :S
[08:00:56] <joal>	 elukey: I'll ask for special authorization from my beloved ops to deploy on a friday
[08:01:20] <elukey>	 sure makes sense!
[08:04:40] <wikibugs>	 (03PS2) 10Joal: Fix mediawiki-history-page create event [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825)
[08:22:25] <elukey>	 just created https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue/Administration#Restart_Hue
[09:14:56] <elukey>	 https://github.com/EDS-APHP/py-hdfs-mount
[09:15:18] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-Elukey: Allow all Analytics tools to work with Kerberos auth - https://phabricator.wikimedia.org/T226698 (10elukey)
[10:15:49] <joal>	 I didn't know french-hospitals were into big data ;)
[10:21:09] <elukey>	 these french people, always obsessed with data! 
[10:21:10] <elukey>	 :P
[10:21:17] <joal>	 hihi
[10:21:50] <elukey>	 just moved two hdfs crons on an-master1002 to timers
[10:22:03] <joal>	 \o/
[10:22:04] <elukey>	 (fetch fsimage + prune old images)
[10:22:11] <joal>	 MOAR timers
[10:24:56] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Allow all Analytics tools to work with Kerberos auth - https://phabricator.wikimedia.org/T226698 (10elukey)
[10:42:19] <icinga-wm>	 PROBLEM - Check if active EventStreams endpoint is delivering messages. on icinga1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams
[10:42:43] <elukey>	 this is no bueno
[10:43:20] <wikibugs>	 (03PS3) 10Joal: Fix mediawiki-history-page create event [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825)
[10:46:01] <elukey>	 https://stream.wikimedia.org/v2/stream/recentchange returns 502 from nginx 
[10:46:44] <joal>	 aouch :S
[10:46:56] <joal>	 elukey: Can I do anything to help?
[10:47:13] <elukey>	 trying to check where the problem lies, the backend looks good
[10:47:56] <joal>	 elukey: have you had alarms? I don't see any
[10:48:27] <elukey>	 12:42  <icinga-wm> PROBLEM - Check if active EventStreams endpoint is delivering messages. on icinga1001 is CRITICAL: CRITICAL:
[10:48:30] <elukey>	                    No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10
[10:48:33] <elukey>	                    seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams
[10:48:36] * joal looks just a bit ahead and say sorry
[10:48:36] <elukey>	 joal: --^
[10:48:40] <elukey>	 ah okok :)
[11:43:11] <icinga-wm>	 RECOVERY - Check if active EventStreams endpoint is delivering messages. on icinga1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams
[11:45:36] <elukey>	 super weird
[11:49:58] <wikibugs>	 10Analytics, 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey)
[11:50:05] <elukey>	 joal: --^
[11:50:08] <elukey>	 super weird
[11:50:38] <elukey>	 so this morning I noticed that eventstreams in codfw didn't work
[11:50:49] <elukey>	 (it is active/active, so we serve from both eqiad and codfw)
[11:51:05] <elukey>	 so I rolled restart eventstreams on scb2* nodes
[11:51:07] <elukey>	 (codfw ones)
[11:51:17] <elukey>	 then a slow creep up for the eqiad nodes happened
[11:51:21] <elukey>	 not sure why
[11:51:26] <elukey>	 up to my restart
[11:51:33] <elukey>	 and now things are stable
[11:51:48] <elukey>	 this might be due to varnish throttling, but I don't have more ideas/proofs
[11:51:51] <joal>	 :S
[11:53:37] <elukey>	 going to lunch!
[12:23:11] <joal>	 hey milimetric - would you by any chance be nearby?
[12:44:19] <wikibugs>	 (03PS4) 10Joal: Fix mediawiki-history-page create event [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825)
[12:45:47] <wikibugs>	 10Analytics, 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Samwalton9) Looks good to me, thanks!
[13:04:27] <milimetric>	 joal: hey
[13:04:31] <milimetric>	 all yours
[13:04:32] <joal>	 Hi milimetric :)
[13:04:38] <joal>	 batcave?
[13:04:41] <milimetric>	 yes omw
[13:08:15] <ottomata>	 milimetric:  will you review my jsonschema-tools code?  you can also try it out
[13:08:36] <ottomata>	 easiest to review here https://github.com/wikimedia/jsonschema-tools/pull/1
[13:13:13] <elukey>	 ottomata: o/
[13:13:18] <ottomata>	 o/
[13:13:43] <elukey>	 did you see my ping in _security?
[13:14:04] <elukey>	 we are discussing https://phabricator.wikimedia.org/T226808
[13:14:12] <elukey>	 if you have a minute can you join?
[13:31:26] <wikibugs>	 10Analytics, 10EventBus, 10Operations, 10Core Platform Team Backlog (Watching / External), and 2 others: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 (10Ottomata) Hm @herron, today we experienced {T226808}, which I think is...
[13:38:55] <bearloga>	 My Hive query fails because of running out of memory on notebook1004. Can anyone recommend anything I can do so it doesn't or should I change the query from looking at a month to looking at a day? https://gist.github.com/bearloga/e37a0d985be2c0c6ff52be84cf96a38c
[13:39:59] <ottomata>	 bearloga which kernel are you using?
[13:55:30] <bearloga>	 ottomata: kernel?
[13:57:26] <bearloga>	 using hive, not beeline or spark (if that's what you're asking?)
[14:03:45] <bearloga>	 ottomata: oh duh jupyter kernel. I'm using the R one
[14:04:35] <bearloga>	 but our R package queries hive through the `hive` command
[14:06:14] <milimetric>	 sorry ottomata going to focus on the mediawiki history stuff a bit
[14:09:09] <ottomata>	 milimetric:  no hurry on that at all!
[14:10:26] <ottomata>	 bearloga:  try
[14:10:26] <ottomata>	 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries#Hadoop_containers_run_out_of_memory
[14:10:41] <ottomata>	 althhough i think more thhan 4GB might be needed for yours
[14:10:57] <ottomata>	 i think the container limits have been increased to 4G by default anyuway
[14:12:12] <elukey>	 I am wondering if we should start advertising the fact that when kerberos will be enabled, beeline/hive-server-2 will need to be used (since they will be the only ones allowing auth)
[14:15:19] <wikibugs>	 10Analytics, 10User-Elukey: Show IPs matching a list of IP subnets in Webrequest data - https://phabricator.wikimedia.org/T220639 (10faidon) 05Resolved→03Open So, a few things: - There is a better source for this kind of data, that is updated hourly rather than monthly: https://as286.net/data/ana-invalids....
[14:16:54] <icinga-wm>	 PROBLEM - Check if active EventStreams endpoint is delivering messages. on icinga1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams
[14:19:48] <ottomata>	 elukey:  +1
[14:22:31] <bearloga>	 ottomata: thank you! that fixed it
[14:31:49] <wikibugs>	 10Analytics, 10Analytics-Kanban: Fix Hive partition thresholding in refinery-drop-older-than - https://phabricator.wikimedia.org/T226835 (10mforns)
[14:36:02] <wikibugs>	 (03PS1) 10Mforns: Fix daily and monthly partition thresholding [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519643 (https://phabricator.wikimedia.org/T226835)
[14:38:43] <wikibugs>	 (03CR) 10Mforns: [V: 03+2] Fix daily and monthly partition thresholding [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519643 (https://phabricator.wikimedia.org/T226835) (owner: 10Mforns)
[14:57:10] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "Nice thank you!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519643 (https://phabricator.wikimedia.org/T226835) (owner: 10Mforns)
[15:10:42] <wikibugs>	 (03CR) 10Milimetric: "Since these comments are all about comments, I'll make the changes and send the patch." (037 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal)
[15:12:55] <elukey>	 I am checking my spark job via https://yarn.wikimedia.org/proxy/application_1561367702623_13494/
[15:13:06] <elukey>	 but it seems stuck.. is it waiting for something?
[15:13:16] <elukey>	 resources seem to be available in the default queue
[15:13:39] <elukey>	 I changed the code of the rpki stuff, it should be working but i don't see any error in the logs
[15:14:41] <elukey>	 in fairness I requested 656384 MB of ram, that might be too much
[15:16:51] <wikibugs>	 10Analytics, 10Discovery, 10Operations, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Nuria) The expiration for objects can be specified at the time of upload, so it needs to be added to our current wo...
[15:20:46] <ottomata>	 elukey:  looks like it was killed?
[15:20:52] <ottomata>	 ohh by user.
[15:20:55] <wikibugs>	 (03PS5) 10Milimetric: Fix mediawiki-history-page create event [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal)
[15:20:57] <wikibugs>	 10Analytics, 10Discovery, 10Operations, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10fgiunchedi) @EBernhardson @Ottomata re: swift expiring objects, see the link above too and tl;dr is:  The X-Delete-...
[15:22:24] <wikibugs>	 10Analytics, 10Discovery, 10Operations, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) Great!  @fgiunchedi you said 'that is something we'd have to deploy first'.  Can I use this now?
[15:26:32] <elukey>	 ottomata: yeah I did it, re-launched with less workers but not much different.. https://yarn.wikimedia.org/cluster/app/application_1561367702623_13562
[15:26:42] <elukey>	 so I am pretty sure I am failing to do something
[15:27:26] <ottomata>	 elukey:  it looks like it is doing stuff... at least it has active executors
[15:28:02] <elukey>	 so those are either super slow or doing something not good right?
[15:28:10] <ottomata>	 i guess?
[15:28:12] <elukey>	 is there a way to see what they are doing?
[15:28:22] <ottomata>	 https://yarn.wikimedia.org/proxy/application_1561367702623_13562/stages/
[15:28:27] <ottomata>	 128 running
[15:28:28] <ottomata>	 0 done.
[15:28:41] <ottomata>	 maybe the showString is your problem?
[15:28:45] <ottomata>	 where is your .py file?
[15:28:47] <ottomata>	 can I look?
[15:28:59] <ottomata>	 maybe you should write to a file rather than stdout?
[15:29:05] <ottomata>	 (just guessing here)
[15:29:23] <elukey>	 sure, it is in /home/elukey/rpki/rpki_invalid_prefix_finder.py on stat1004
[15:29:51] <elukey>	 I am trying to change the UDF that joseph came up with after a chat with Faidon, but probably now the UDF is wrong
[15:30:43] <wikibugs>	 (03CR) 10Nuria: Fix daily and monthly partition thresholding (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519643 (https://phabricator.wikimedia.org/T226835) (owner: 10Mforns)
[15:30:59] <ottomata>	 elukey:  yeah, i think maybe the .show(10000, False) is theh problem?
[15:31:07] <ottomata>	 what if you do instead
[15:31:59] <ottomata>	 i think
[15:32:15] <elukey>	 ottomata: it worked before I changed the UDF, not sure if it is the show()
[15:32:21] <ottomata>	 .write.options("delimeter", "\t").csv("/user/elukey/path/to/output")
[15:32:21] <ottomata>	 ?
[15:32:25] <fdans>	 nuria: mforns milimetric hellooo before we freeze wikistats ui development, I think this is the one patch out of the unmerged 4 that really needs to be merged and deployed
[15:32:25] <fdans>	 https://gerrit.wikimedia.org/r/#/c/analytics/wikistats2/+/519016/
[15:32:30] <ottomata>	 that might be old though..
[15:32:53] <ottomata>	 hmm yeah maybe not?
[15:33:00] <ottomata>	 what did you change?
[15:33:16] <elukey>	 (sorry meeting)
[15:44:16] <mforns>	 fdans, lookin!
[15:48:43] <wikibugs>	 10Analytics, 10Discovery, 10Operations, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10fgiunchedi) >>! In T213976#5292498, @Ottomata wrote: > Great!  @fgiunchedi you said 'that is something we'd have to...
[15:49:23] <wikibugs>	 (03CR) 10Mforns: [V: 03+2] Fix daily and monthly partition thresholding (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519643 (https://phabricator.wikimedia.org/T226835) (owner: 10Mforns)
[15:49:29] <wikibugs>	 10Analytics, 10Discovery, 10Operations, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) Oh ok, will do!
[15:57:28] <joal>	 hey milimetric - pre-standup?
[15:57:47] <milimetric>	 omw
[16:01:05] <nuria>	 ping ottomata 
[16:01:15] <nuria>	 ping fdans 
[16:01:43] <wikibugs>	 10Analytics, 10Discovery, 10Operations, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10mmodell) >>! In T213976#4968603, @Ottomata wrote: > @mmodell This is kind of a 'deployment' process thing, is this...
[16:01:45] <wikibugs>	 (03PS6) 10Milimetric: Fix mediawiki-history-page create event [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal)
[16:01:50] <wikibugs>	 10Analytics, 10Discovery, 10Operations, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10mmodell) >>! In T213976#4995886, @Ladsgroup wrote: > Yes, you're right. Maybe turning mwmaint1002 to a minikube and...
[16:02:33] <ottomata>	 OH uh
[16:13:09] <joal>	 Nettrom: Good morning -- I apologize I'm late on deploy
[16:13:38] <joal>	 Nettrom: I will deploy later on today (even if Frida, I have special permisions from my dear ops-
[16:18:15] <icinga-wm>	 RECOVERY - Check if active EventStreams endpoint is delivering messages. on icinga1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams
[16:19:25] <wikibugs>	 (03CR) 10Joal: [C: 03+2] Fix daily and monthly partition thresholding [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519643 (https://phabricator.wikimedia.org/T226835) (owner: 10Mforns)
[16:19:42] <mforns>	 thanks joal :D
[16:19:45] <joal>	 ;)
[16:26:14] <wikibugs>	 (03PS2) 10Joal: Update mediawiki-history for page-history refactor [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519349 (https://phabricator.wikimedia.org/T221825)
[16:29:04] <Nettrom>	 joal: now I'm worried that you're reading my mind, as I was about to ask about the deployment ;) sorry you're having to do it on a Friday, I appreciate you working at this hour
[16:30:02] <joal>	 Nettrom: I try not to read minds, I have too many problems with mine alone - I'll keep you posted on deploy happenning :)
[16:35:45] <joal>	 milimetric: shall I merge the patch and start deployment?
[16:35:50] <joal>	 Arf
[16:35:52] <joal>	 sorry
[16:35:53] <joal>	 :)
[16:37:07] <wikibugs>	 10Analytics, 10Analytics-Kanban: Move reportupdater queries from limn-* repositories to reportupdater-queries - https://phabricator.wikimedia.org/T222739 (10Nuria) See code change: https://gerrit.wikimedia.org/r/#/c/analytics/reportupdater-queries/+/517084/ (merged on other task)
[16:37:11] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Let's merge!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal)
[16:37:29] <joal>	 ottomata: anything to merge before I deploy?
[16:37:31] <wikibugs>	 10Analytics, 10Analytics-Kanban: Move reportupdater queries from limn-* repositories to reportupdater-queries - https://phabricator.wikimedia.org/T222739 (10Nuria) Puppet code : https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/517085/
[16:38:28] <ottomata>	 joal:  deploy? not for me
[16:38:32] <ottomata>	 it is a friday tho?! deploY!?
[16:38:33] <ottomata>	 :p
[16:38:50] <joal>	 ottomata: I got a special perm from elukey - but I know I'm doing wrong
[16:39:51] <wikibugs>	 10Analytics, 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) EventStreams is hitting its concurrent connection limits of about 200 connections.  We think this is probably due to a single cl...
[16:39:58] <ottomata>	 :)
[16:43:55] <wikibugs>	 (03PS1) 10Joal: Remove incorrect comment and fix typo in changelog [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519666
[16:44:16] <ebernhardson>	 ottomata: what swift container for test uploads?
[16:44:28] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519666 (owner: 10Joal)
[16:44:44] <ottomata>	 ebernhardson:  up to you mostly, what do you think would be best?
[16:44:47] <ottomata>	 'elasticsearch'?
[16:45:07] <ebernhardson>	 ottomata: so it doesn't really matter, just a tag of sorts?
[16:45:13] <ottomata>	 ya
[16:45:23] <ebernhardson>	 gotcha
[16:48:49] <elukey>	 so the /mnt/hdfs mountpoint works with kerberos! it just fails if you don't have a ticket with input/output errors
[16:50:21] <ottomata>	 nice
[16:54:48] <ottomata>	 elukey:  we can hack config on scb1001 to enable trace logging with X-Client-IP, restart eventstreams
[16:54:51] <ottomata>	 and see
[16:54:53] <ottomata>	 shall we?
[16:55:12] <ottomata>	 yar i have a meeting for the next 30 mins tho...
[16:57:49] <elukey>	 ottomata: not sure if X-Client-IP is sent by varnish to the backend, but if so yes!
[16:57:56] <ottomata>	 it should be!
[16:58:25] <ottomata>	 ok will do this meeting then we try
[16:59:13] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Ottomata)
[17:02:45] <elukey>	 ottomata: or we could use tcpdump :)
[17:08:06] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Update mediawiki-history for page-history refactor [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519349 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal)
[17:08:08] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update mediawiki-history for page-history refactor [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519349 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal)
[17:08:34] <wikibugs>	 (03CR) 10Joal: [V: 03+2] Bump jar version for oozie webrequest load bundle [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519506 (https://phabricator.wikimedia.org/T225792) (owner: 10Joal)
[17:11:54] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey)
[17:12:21] <joal>	 !log Refinery-source v0.0.93 released to archiva
[17:12:23] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:12:38] <joal>	 !log Deploying refinery with scap
[17:12:40] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:22:29] <wikibugs>	 10Analytics, 10Operations, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Pchelolo)
[17:28:23] <ottomata>	 elukey:  know of a good way to be sure to catcpure the X-Client-IP Header?
[17:28:34] <ottomata>	 with tcpdump?
[17:28:36] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey)
[17:29:36] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey)
[17:30:57] <ottomata>	 acutally maybe sudo tcpdump -A port 8092 | grep X-Client-IP is enought!
[17:31:05] <elukey>	 ottomata: in theory we could capture everything with -A and then use wireshark to follow the HTTP flow, lemme try 
[17:31:11] <elukey>	 not sure if you'll see all the headers
[17:31:35] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey)
[17:31:38] <ottomata>	 althoughh i'm not sure if i'm seeing that every time
[17:31:48] <ottomata>	 elukey:  hacking the config there is pretty easy
[17:31:54] <ottomata>	 we could do it officially via scap if we wanted to
[17:32:00] <ottomata>	 but it is just adding a couple of lines in the config
[17:32:02] <elukey>	 sure we can do it manually
[17:33:40] <ottomata>	 ok elukey
[17:33:40] <ottomata>	 hacked
[17:33:44] <ottomata>	 will depool, restart, pool
[17:33:53] <wikibugs>	 (03CR) 10Mforns: [C: 03+2] Create "all" time ranges based on the metric config [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/519016 (https://phabricator.wikimedia.org/T226486) (owner: 10Fdans)
[17:33:54] <ottomata>	 sudo journalctl -f -u eventstreams
[17:33:56] <ottomata>	 to follow :)
[17:34:01] <bearloga>	 "Unexpected error while saving file: <jupyter notebook> database or disk is full" uhh everything alright with notebook1004?
[17:34:13] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Ottomata)
[17:34:24] <Nettrom>	 darn, bearloga beat me to reporting that
[17:34:28] <mforns>	 aaah! clicked the +2 button by mistake again! :[
[17:34:37] <elukey>	 bearloga: checking, did any of you by any chance create a lot of files?
[17:34:40] <ottomata>	 sighh probably not!  who's using theh space again?!?!
[17:34:53] <ottomata>	  /dev/mapper/notebook1004--vg-data  136G  129G     0 100% /srv
[17:35:05] <elukey>	 we need quotas
[17:35:29] <elukey>	 I am checking homes
[17:35:50] <ottomata>	 47G	nathante
[17:35:57] <elukey>	 yep
[17:36:06] <elukey>	 he is not online though
[17:36:25] <mforns>	 elukey, is there a way to interrupt a gate and submit job?
[17:36:40] <elukey>	 mforns: not sure what you mean
[17:36:45] <mforns>	 elukey, can you cancel the wikistats one: https://integration.wikimedia.org/zuul/
[17:36:50] <elukey>	 ahhhh
[17:36:59] <mforns>	 I clicked on +2 accidentally,
[17:37:17] <mforns>	 nothing super critical, but just wanted to avoid the revert...
[17:37:24] <elukey>	 if you revert your +2, shoudn't it be ok?
[17:37:34] <elukey>	 the CI should only add V+2 IIRC
[17:38:22] <mforns>	 elukey, ahaahh, but the description of the job in zuul said it would merge it
[17:38:26] <mforns>	 it didn't though
[17:38:31] <elukey>	 super :)
[17:39:08] <mforns>	 thanks elukey, I think gerrit +2 button should not be blue!
[17:39:18] <elukey>	 ahahahah
[17:39:25] <mforns>	 should be gray like all the normal buttons!
[17:41:41] <bearloga>	 Just DMed Nate on Twitter
[17:42:52] <elukey>	 fixed it bearloga 
[17:43:01] <elukey>	 there was a trash directory taking ~36G
[17:43:38] <elukey>	 !log deleted /srv/home/nathante/.local/share/Trash/* to free space on notebook1004
[17:43:40] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:44:19] <groceryheist>	 hi
[17:45:57] <elukey>	 groceryheist: hello :)
[17:46:10] <elukey>	 there was a big trash directory in /srv/home/nathante/.local/share/Trash/
[17:46:20] <elukey>	 that was causing a lot of disk usage on notebook1003
[17:46:22] <elukey>	 err 1004
[17:46:26] <elukey>	 now gone, so all good :)
[17:48:37] <ottomata>	 ok restarted and repooled, these logs go to /srv/log/eventstreams/main.log
[17:48:51] <bearloga>	 Is that trash dir only for files that get deleted through the jupyter interface? rm doesn't do that, right?
[17:49:19] <ottomata>	 although I don't see any for /v2/streams...
[17:49:27] <groceryheist>	 yeah I wasn't really aware this was a thing that happens
[17:49:37] <elukey>	 bearloga: yes I think so, rm brutally drops data :)
[17:52:10] <elukey>	 ottomata: did you find anything useful on scb1001?
[17:54:13] <ottomata>	 elukey: its still logging but not yet?  i do'nt see any incoming stream requets...
[17:54:56] <ottomata>	 yar, hang on gotta go afk for a few mins....
[18:00:48] <wikibugs>	 (03CR) 10Mforns: "Sorry for the mess with +2 and jenkins." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/519016 (https://phabricator.wikimedia.org/T226486) (owner: 10Fdans)
[18:01:23] <joal>	 elukey: please excuwe me, I'd need some help with scap (I assume size related)
[18:02:16] <elukey>	 joal: sure!
[18:02:25] <elukey>	 notebooks complaining?
[18:02:40] <joal>	 elukey: 2 targets failed, don't know which
[18:03:52] <elukey>	 yep notebook1003 100% usage
[18:03:57] <joal>	 indeed elukey - notebook1003 and notebook1004
[18:04:01] <joal>	 :s
[18:04:06] <joal>	 shall I rollback?
[18:04:08] <elukey>	 joal: I really think that we should force scap to keep only one revision
[18:04:11] <elukey>	 no no
[18:04:14] <elukey>	 lemme fix it
[18:04:25] <joal>	 I say no to rollback then
[18:04:29] <joal>	 ok?
[18:04:50] <joal>	 elukey: We could keep more than 1, maybe like 5 - but that would already be a lot less
[18:05:32] <elukey>	 joal: no no more than 1 is exactly the problem, those revs are huge
[18:05:35] <elukey>	 we have 2 now
[18:05:49] <elukey>	 so I cleaned up notebook1003
[18:05:51] <joal>	 Ahhhh - you're talking about jars !!!!
[18:06:04] <elukey>	 well the whole thing
[18:06:04] <joal>	 1-rev whu not, and 5 jars no more :)
[18:06:11] <elukey>	 ahhh yes 5 jars
[18:06:13] <elukey>	 yes yes
[18:06:17] <elukey>	 ok I  got you :)
[18:06:21] <joal>	 same for me :)
[18:06:36] <joal>	 ok - so saying 'no' to scap about rollback ?
[18:06:39] <joal>	 elukey: --^
[18:06:41] <elukey>	 yep yep
[18:06:51] <joal>	 ok- I'm afraid when doing so :)
[18:06:56] <elukey>	 you can try to deploy with limit to 1003
[18:07:03] <elukey>	 1004 need to check why it failed
[18:07:15] <joal>	 I think it's space related as well
[18:07:39] <elukey>	 ah interesting
[18:07:40] <elukey>	 Using deprecated git_fat config, swap to git_binary_manager
[18:07:47] <elukey>	 not the error but worth to follow up
[18:08:35] <joal>	 elukey: going for notebook1003
[18:08:50] <joal>	 went so fast that I ssumed it actually failed (no jar downloaded)
[18:08:52] <elukey>	 joal: try also 1004, srv partition is 73% full
[18:08:54] <elukey>	 so a lot of space
[18:08:59] <joal>	 ok
[18:09:13] <joal>	 Same thing
[18:09:24] <elukey>	 where/how would we trim the jars that we keep? (so I can try to follow up next week)
[18:09:25] <joal>	 super fast, no jar download
[18:09:38] <elukey>	 maybe you can --force
[18:10:01] <joal>	 elukey: It means checking everywhere (oozie, timers, cron) about version and if we can safely bump
[18:10:19] <elukey>	 ah so manually
[18:10:45] <joal>	 elukey: I think we need to rollback the last rev on notebooks, to deploy again (jars corrupted)
[18:11:05] <elukey>	 joal: iirc if you --force you should create a new rev no?
[18:11:25] <joal>	 Ah, I don;t know
[18:11:27] <joal>	 testing
[18:11:58] <joal>	 seems correct indeed elukey ! many thanks :)
[18:12:25] <elukey>	 \o/
[18:13:46] <joal>	 ok full deploiy done with scap - deploying on hdfs
[18:13:59] <bearloga>	 groceryheist Nettrom elukey: so I looked everywhere in Jupyter Notebook & Lab interfaces and there's nothing about emptying the trash.
[18:14:05] <joal>	 Nettrom: Your whitelist patch has been deployed, you should see changes soon :)
[18:14:29] <elukey>	 joal: going afk a bit for dinner, will bbl! 
[18:14:39] <joal>	 bye elukey - Thanks agai
[18:14:41] <Nettrom>	 joal: wonderful, thanks for making that happen!
[18:15:04] <bearloga>	 so not sure what we can do going forward except…uh…remember to SSH into notebook100X every now and then and check it ourselves???
[18:15:31] <joal>	 !log Deploy refinery to HDSF
[18:15:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:15:47] <groceryheist>	 bearloga: setup a cron job to remove big files from the trash?
[18:17:23] <bearloga>	 groceryheist: oooooh I like that idea. hm… ottomata: is that something that can be enabled for all users automatically? (a cron job to remove big trashed files, since there's no way to do it through Jupyter's UI)
[18:24:32] <joal>	 Arf- forgot to update jar version in data-quality bundle prop
[18:25:43] <wikibugs>	 (03PS1) 10Joal: Bump data-quality oozie bundle jar to v0.0.93 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519678
[18:25:52] <joal>	 mforns: --^
[18:29:10] <ottomata>	 phew ok back
[18:30:55] <ottomata>	 ok collected a buncha logs
[18:31:04] <ottomata>	 resetting eventstreams logging back to normal
[18:31:46] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519678 (owner: 10Joal)
[18:32:34] <joal>	 ottomata: I need to dpeloy again (refinery with scap), had forgotten a jar-bump - Do you mind being reachable, I already ran intio issues earlier on becasue of space :(
[18:32:43] <ottomata>	 joal:  am here now ya
[18:33:27] <joal>	 Thanks
[18:33:46] <joal>	 !log Deploy refinery with scap
[18:33:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:34:22] <ottomata>	 elukey:  indeed a see a loot from a single IP
[18:35:06] <ottomata>	 lot*
[18:35:15] <ottomata>	 proceeding with the ip throttling idea
[18:35:34] <wikibugs>	 10Analytics, 10Operations, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) Collected some info about which IPs were connecting on scb1001.  Over a period of about 40 minutes:        3 "100.26....
[18:40:17] <wikibugs>	 (03PS1) 10Zhuyifei1999: queryrun.py: Default extra_info to {} [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/519679
[18:41:15] <elukey>	 ottomata: what is the idea? code in es?
[18:42:00] <elukey>	 bearloga: about the Trash, there must be a way to autoclean it, let's try to find it first :)
[18:42:15] <elukey>	 (maybe a setting in jupyterhub)
[18:44:25] <wikibugs>	 (03CR) 10Zhuyifei1999: [C: 03+2] queryrun.py: Default extra_info to {} [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/519679 (owner: 10Zhuyifei1999)
[18:44:35] <elukey>	 ottomata: I am logging off now, but please ping me if you need me 
[18:44:48] <wikibugs>	 (03Merged) 10jenkins-bot: queryrun.py: Default extra_info to {} [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/519679 (owner: 10Zhuyifei1999)
[18:44:53] <nuria>	 ottomata: i can be rubberduck in elukey 's absence
[18:46:09] <mforns>	 joal, was in meeting, cool! thanks for deploy!
[18:46:18] <joal>	 np, redeploing
[18:50:21] <joal>	 ottomata: error on analytics1030.eqiad.wmnet
[18:50:23] <joal>	  :(
[18:50:28] <joal>	 I assume disk full
[18:51:01] <joal>	 ottomata: I now know how to redeploy (limit and force), but I'd need you to remove some revs first please
[18:52:14] <joal>	 ! Deploying refinery to HDFS
[18:52:46] <joal>	 !log Kill webrequest bundle
[18:52:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:53:16] <joal>	 !log Kill data-quality-hourly bundle
[18:53:17] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:54:34] <wikibugs>	 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list as needed for iOS - https://phabricator.wikimedia.org/T226849 (10kzimmerman)
[18:55:17] <wikibugs>	 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Web - https://phabricator.wikimedia.org/T226850 (10kzimmerman)
[18:55:58] <wikibugs>	 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Android - https://phabricator.wikimedia.org/T226852 (10kzimmerman)
[18:56:43] <wikibugs>	 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for AHT - https://phabricator.wikimedia.org/T226853 (10kzimmerman)
[18:57:06] <wikibugs>	 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Growth - https://phabricator.wikimedia.org/T226854 (10kzimmerman)
[18:57:48] <wikibugs>	 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Editing - https://phabricator.wikimedia.org/T226855 (10kzimmerman)
[18:58:03] <wikibugs>	 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Language - https://phabricator.wikimedia.org/T226856 (10kzimmerman)
[18:59:25] <joal>	 !log Restart Webrequest bundle
[18:59:27] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:01:07] <ottomata>	 sorry joal  just sawa ping
[19:01:15] <ottomata>	 looking
[19:01:20] <joal>	 thanks ottomata :)
[19:02:06] <joal>	 mforns: !log restart data-quality-hourly bundle
[19:02:35] <mforns>	 joal, :]
[19:02:51] <joal>	 Ok - Taking a break, will monitor jobs in a bit, and also finalize deploy for MWH now that other jobs are done
[19:02:51] <ottomata>	 ok joal  try now
[19:02:55] <joal>	 sure ottomata 
[19:08:07] <joal>	 Worked ottomata - many thanks :)
[19:11:51] <ottomata>	 great
[19:15:15] <wikibugs>	 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Community Tech - https://phabricator.wikimedia.org/T226861 (10nettrom_WMF)
[19:16:21] <wikibugs>	 10Analytics, 10Community-Tech, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Community Tech - https://phabricator.wikimedia.org/T226861 (10nettrom_WMF)
[19:16:52] <wikibugs>	 10Analytics, 10Analytics-Kanban: Make timers that delete data use the new deletion script - https://phabricator.wikimedia.org/T226862 (10mforns)
[19:17:14] <wikibugs>	 10Analytics, 10Anti-Harassment, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for AHT - https://phabricator.wikimedia.org/T226853 (10nettrom_WMF)
[19:18:25] <wikibugs>	 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Growth - https://phabricator.wikimedia.org/T226854 (10nettrom_WMF)
[19:20:44] <nuria>	 cc Nettrom deployment happen today 
[19:20:53] <nuria>	 due to some last minute work
[19:27:00] <Nettrom>	 nuria: thanks! joal kept me in the loop on that today. Checked the Data Lake now and can confirm that data is flowing in and appears to be correct
[19:27:43] <Nettrom>	 time for lunch
[19:53:43] <mforns>	 a-team, finished what I wanted to do, going for vacation, see you in a couple weeks! have fun :]
[19:53:51] <ottomata>	 byyeeee marcel!
[19:53:55] <milimetric>	 mforns: have an awesome time!
[19:54:05] <elukey>	 o/ mforns 
[19:54:06] <joal>	 See you Marcel, enjoy your time off !
[19:54:37] <mforns>	 thank youuuu!
[19:54:48] <mforns>	 joal, have a nice time too :D
[19:54:54] <joal>	 I will :)
[19:55:24] <mforns>	 :]
[20:04:32] <joal>	 !log drop-recreate mediawiki_history, mediawiki_page_history and mediawiki_user_history tables in hive
[20:04:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:09:26] <nuria>	 groceryheist: btw, since you work with readingDepth data this new metric might be of interest
[20:09:55] <nuria>	 groceryheist: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/SessionLength
[20:47:09] <wikibugs>	 (03CR) 10Milimetric: Fix mediawiki-history-page create event (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal)
[20:57:51] <joal>	 !log Restart mediawiki-history-denormalize-coord, mediawiki-history-check_denormalize-coord, mediawiki-history-reduced-coord
[20:57:53] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:59:02] <wikibugs>	 10Analytics, 10Operations, 10Patch-For-Review, 10Security, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata)
[21:01:54] <wikibugs>	 (03PS1) 10Joal: Fix typo in mediawiki_page_history table creation [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519719
[21:03:19] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] Fix mediawiki-history-page create event (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/519521 (https://phabricator.wikimedia.org/T221825) (owner: 10Joal)
[21:04:12] <wikibugs>	 (03PS2) 10Joal: Fix typo in mediawiki_page_history table creation [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519719
[21:04:30] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "typo, merging" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/519719 (owner: 10Joal)
[21:05:29] <joal>	 ok done for tonight - prod is almost ready to welcome the new snapshot, only missing a checker-coherent of previous version ,which is currently being computed
[21:22:53] <wikibugs>	 10Analytics, 10Operations, 10Patch-For-Review, 10Security, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) To hold us over on the weekend, I've manually blacklisted the offending IP in Eve...
[21:23:46] <ottomata>	 gotta run!  have a good weekend all!
[23:03:08] <wikibugs>	 10Analytics, 10Operations, 10Patch-For-Review, 10Security, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Nuria) {F29666420}  Well, blocking that one IP had the effect of lowering connections. Give...
[23:05:58] <groceryheist>	 thanks nuria!
[23:06:28] <groceryheist>	 that's pretty nice!
[23:07:03] <nuria>	 groceryheist: we hope to have that data by q2, that is probably oct or so
[23:07:20] <groceryheist>	 nuria: I'd like to work on helping create pathways to release this public versions of kind of data
[23:07:51] <groceryheist>	 cool
[23:08:03] <groceryheist>	 would you like some feedback on the design?
[23:18:40] <nuria>	 groceryheist: totally, 
[23:19:02] <nuria>	 groceryheist: you can do it wiki style (talk page) or the channel