[00:06:27] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10good first task: [reportupdater] Allow defaults for all config parameters - https://phabricator.wikimedia.org/T193171 (10Nuria) Thank you @paulkernfeld , let us know if you are interested on another task, you can ping us in #wikimedia-analytics on irc   an example of one (i...
[01:05:13] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Technical contributors emerging communities metric definition, thick data - https://phabricator.wikimedia.org/T250284 (10jwang) I have published the report to meta wiki. Feel free to comment. Link: https://meta.wikimedia.org/wiki/Research:Emerging_Technic...
[06:17:31] <wikibugs>	 10Analytics: Ensure Puppet checks types as part of the build - https://phabricator.wikimedia.org/T261693 (10elukey) >>! In T261693#6428636, @Ottomata wrote: > Ya, PCC is good, but Razzi created this ticket hoping that an obvious incorrect type like this could be caught by the Jenkins tests that run with every pa...
[06:18:03] <elukey>	 good morning
[06:24:48] <wikibugs>	 10Analytics: Ensure Puppet checks types as part of the build - https://phabricator.wikimedia.org/T261693 (10razzi) Yes, in my opinion there should be a nonzero exit code of the type checker, and that should propagate to the Jenkins job.
[07:18:55] <elukey>	 very interesting reading https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt
[07:19:15] <elukey>	 we have a recurrent issue about soft cpu locks ups for hadoop workers
[07:20:07] <elukey>	 once every now and then one worker gets into a state that causes alarms (too busy to respond to icinga pings) and the mgmt serial console shows soft lockups in the tty with the login
[07:20:28] <elukey>	 (root login via serial console is not available, too slow even to create a session)
[07:20:58] <elukey>	 in this case, 1059 was reported as "down" from icinga
[07:20:58] <elukey>	 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=analytics1059&var-datasource=thanos&var-cluster=analytics&from=now-24h&to=now
[07:21:26] <elukey>	 and I had to powercycle the host to restore its functionality
[07:22:59] <elukey>	 (the watchdog is supposed to update a timestamp inside the kernel periodically, if it doesn't do it for 20s then a stall is detected)
[07:24:57] <elukey>	 this might be something that goes away with the migration to Bigtop + buster probably
[07:38:28] <elukey>	 --
[07:38:40] <elukey>	 I am going to reimage jumbo1003 to buster
[07:44:02] <wikibugs>	 10Analytics-Clusters: Upgrade Kafka Brokers to Debian Buster - https://phabricator.wikimedia.org/T255123 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['kafka-jumbo1003.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202009020743_el...
[07:49:56] <elukey>	 so I wanted to also start the roll restart of hadoop for openjdk upgrades, buuut I see that sqoop is running so not a great idea :D
[08:18:16] <wikibugs>	 10Analytics-Clusters, 10Patch-For-Review: Upgrade Kafka Brokers to Debian Buster - https://phabricator.wikimedia.org/T255123 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-jumbo1003.eqiad.wmnet'] `  and were **ALL** successful.
[08:37:13] <elukey>	 !log run kafka preferred-replica-election on jumbo after jumbo1003's reimage to buster
[08:37:15] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:37:43] <elukey>	 ok jumbo1004 is the only broker left to upgrade to buster
[08:45:08] <wikibugs>	 10Analytics, 10Patch-For-Review: Add urlshortener button to Turnilo - https://phabricator.wikimedia.org/T233336 (10elukey) Today I tried to add the ` --header "X-Forwarded-For: XX.XX.XX.XX"` using my external ipv4 address, and I was able to get a short url from an-tool1005. I guess that we could think about us...
[09:04:45] <wikibugs>	 10Analytics-Clusters, 10Patch-For-Review: Upgrade Kafka Brokers to Debian Buster - https://phabricator.wikimedia.org/T255123 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['kafka-jumbo1004.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-r...
[09:38:44] <wikibugs>	 10Analytics-Clusters, 10Patch-For-Review: Upgrade Kafka Brokers to Debian Buster - https://phabricator.wikimedia.org/T255123 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-jumbo1004.eqiad.wmnet'] `  and were **ALL** successful.
[09:41:47] <elukey>	 all jumbo nodes on busteR!
[09:43:10] <wikibugs>	 10Analytics-Clusters, 10Patch-For-Review: Upgrade Kafka Brokers to Debian Buster - https://phabricator.wikimedia.org/T255123 (10elukey) https://gerrit.wikimedia.org/r/623742 didn't really solve the problem, but I wouldn't spend more time into this since all the Jumbo nodes are on buster now.  Calling this task...
[09:43:34] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade Kafka Brokers to Debian Buster - https://phabricator.wikimedia.org/T255123 (10elukey)
[09:43:53] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade Kafka Brokers to Debian Buster - https://phabricator.wikimedia.org/T255123 (10elukey) p:05Triage→03Medium a:03elukey
[09:44:03] <wikibugs>	 10Analytics, 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10elukey)
[09:45:02] <wikibugs>	 10Analytics, 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10elukey)
[10:09:09] <wikibugs>	 10Analytics, 10Operations: eventgate-main latencies very high since the failover to codfw - https://phabricator.wikimedia.org/T261846 (10Joe)
[10:09:34] <wikibugs>	 10Analytics, 10Operations: eventgate-main latencies very high since the failover to codfw - https://phabricator.wikimedia.org/T261846 (10Joe) p:05Triage→03Unbreak! Setting priority to UBN! given the seriousness of the perf regression.
[10:14:27] <wikibugs>	 10Analytics, 10Operations: eventgate-main latencies very high since the failover to codfw - https://phabricator.wikimedia.org/T261846 (10Joe) It looks like kafka2003 is the culprit - its broker latencies are in the order of 1 seconds.
[10:28:21] <wikibugs>	 10Analytics, 10Operations: eventgate-main latencies very high since the failover to codfw - https://phabricator.wikimedia.org/T261846 (10Joe) So this is probably due to all the purges going through the codfw kafka2003 server, and that we still haven't partitioned the purge topic.  In normal conditions, the pur...
[10:45:16] * elukey lunch!
[11:49:17] <joal>	 \o/ kaf-ster!
[13:24:51] <milimetric>	 nice
[13:24:53] <milimetric>	 heya
[13:31:23] <wikibugs>	 10Analytics, 10Analytics-Kanban: Undo any temporary changes made while running in codfw - https://phabricator.wikimedia.org/T261865 (10Milimetric)
[13:31:49] <wikibugs>	 10Analytics, 10Analytics-Kanban: Undo any temporary changes made while running in codfw - https://phabricator.wikimedia.org/T261865 (10Milimetric) p:05Triage→03High
[13:34:21] <milimetric>	 elukey / ottomata: yall see the eventgate unbreak now?
[13:34:33] <milimetric>	 https://phabricator.wikimedia.org/T261846
[13:34:47] <elukey>	 we are discussing in #wikimedia-sre :)
[13:35:42] <milimetric>	 ah!  What?!  I'm on wikimedia-operations like a loser and there's a new channel?
[13:36:19] <milimetric>	 k, glad yall on it
[13:36:28] <elukey>	 ahahah nono we follow it as well, it is just too much noise with icigna
[13:36:31] <elukey>	 *icinga
[13:36:36] <elukey>	 so we created a new fancy chan
[13:53:53] <ottomata>	 razzi: klausman there is an occasional Product Analytics + Analytics Engineering sync up meeting happening in 1h, then we have our weekly analytics ops sync meeting in 1.5h (thats just us 4 analytics eng SREs).  
[13:54:01] <ottomata>	 you should both be invited to the analytics eng ops sync
[13:54:13] <ottomata>	 lemme know if you'd like to be invited to the PA + analytics eng one.  
[13:54:32] <ottomata>	 Product Analytics is a team of data scientists and analysts in the product dept. that use our infra
[13:56:48] <elukey>	 so both are in the ops sync meeting invite, but not in the PA one afaics
[13:56:57] <klausman>	 Correct
[13:58:44] <ottomata>	 ya, i guess i'll add yall as optional invites
[13:59:18] <elukey>	 already done ottomata 
[13:59:27] <elukey>	 klausman: can you check?
[13:59:36] <ottomata>	 oh :p
[13:59:43] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Goal, 10Services (watching): Modern Event Platform: Stream Configuration - https://phabricator.wikimedia.org/T205319 (10mpopov)
[13:59:51] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Product-Infrastructure-Data, and 2 others: Streams with empty configs should be rendered as {} in the JSON returned by StreamConfig API - https://phabricator.wikimedia.org/T259917 (10mpopov) 05Open→03Resolved a:05fdans→03Mholloway Thanks, Michael...
[14:01:04] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Goal, 10Services (watching): Modern Event Platform: Stream Configuration - https://phabricator.wikimedia.org/T205319 (10Ottomata) @nuria, I think the work for this can be considered done.  Should we close this parent task?
[14:01:09] <klausman>	 Yup, I see a ⇄ meeting overlapping with the ops sync
[14:06:36] <ottomata>	 yeah, will go to the PA one and then leave early for ops sync
[14:11:50] <klausman>	 I think I'll give it a go, just see what it's like
[14:17:29] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Goal, 10Services (watching): Modern Event Platform: Stream Configuration - https://phabricator.wikimedia.org/T205319 (10Nuria) 05Open→03Resolved
[14:17:31] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Nuria)
[14:29:24] <milimetric>	 razzi: I'm going to deploy today, after standup, do you want to tag along?
[14:49:42] <milimetric>	 ottomata: doesn't this catch the "Any unexpected error" mentioned in the comment just below? https://gerrit.wikimedia.org/g/mediawiki/services/eventstreams/deploy/+/dbc9bbbe7355b844a8c7e4455f0b1e5b5f45053f/node_modules/kafka-sse/lib/KafkaSSE.js#666
[14:50:49] <wmf-insecte>	 Starting build #3 for job wikimedia-event-utilities-maven-release-docker
[14:51:50] <wmf-insecte>	 Project wikimedia-event-utilities-maven-release-docker build #3: 09SUCCESS in 1 min 1 sec: https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/3/
[14:55:23] <ottomata>	 milimetric:  i have some stuff i want to merge before deploy!
[14:55:31] <ottomata>	 trying to get it in , but meetings are starting soon!
[14:56:13] <milimetric>	 ottomata: no prob, I can wait
[15:01:10] <wikibugs>	 (03PS7) 10Ottomata: Use EventSchemaLoader and EventLoggingSchemaLoader from org.wikimedia.eventutilities [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/622369 (https://phabricator.wikimedia.org/T251609)
[15:07:35] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Use EventSchemaLoader and EventLoggingSchemaLoader from org.wikimedia.eventutilities [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/622369 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[15:12:48] <ottomata>	 klausman: man sorry your intro to meetings at WMF so far is "how should we have meetings"
[15:12:53] <ottomata>	 they aren't all like this!
[15:14:06] <razzi>	 milimetric: Yeah, it'd be great to follow the deploy
[15:17:23] <klausman>	 Well, the great thing about that is that I can always claim I know noooothing, since, well, I know nothing on the topic ;)
[15:27:36] <cdanis>	 ottomata: did you have this locally?? https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/623813
[15:27:45] <cdanis>	 it fixes the validation of examples I reported
[15:37:07] <wikibugs>	 10Analytics, 10Operations: eventgate-main latencies very high since the failover to codfw - https://phabricator.wikimedia.org/T261846 (10Joe) 05Open→03Resolved a:03Joe We added two additional partitions to resource_purge, and this seems to have solved the issue, mostly.
[15:56:09] <nuria>	 ottomata, klausman , razzi , elukey : can we move the SRE ops sync 30 mins earlier and we will have the sync up with PA  30 min later?
[15:57:02] <klausman>	 Like, starting next week, do Ops, PA, Standup, 30m each, starting at 17:00?
[15:58:53] <nuria>	 klausman: teh PA meeting is only every 2 weeks
[15:59:09] <klausman>	 Ah, right. So a 30m gap?
[15:59:17] <nuria>	 every other week ya
[15:59:32] <klausman>	 wfm, though I am usually not productive in such slots.
[15:59:34] <nuria>	 ottomata, klausman: is ops standup weekly?
[15:59:45] <nuria>	 klausman: ok, so more compression would be best
[15:59:53] <milimetric>	 (I'll brt for standup, just a minute over)
[15:59:53] <nuria>	 klausman: compression of meetings that is
[15:59:56] <klausman>	 Ops Sync up is weekly, yes
[16:00:28] <nuria>	 klausman: ok, i can move our 1 on 1 to that  "dangling " slot
[16:00:38] <nuria>	 klausman: for best utilization?
[16:00:52] <klausman>	 Sure, that works
[16:10:16] <wikibugs>	 10Analytics, 10Analytics-Kanban: Create new mailing list for analytics systems users - https://phabricator.wikimedia.org/T260849 (10Nuria)
[16:12:25] <wikibugs>	 10Analytics, 10VPS-Projects, 10Puppet: Puppet failing on wikistats.analytics.eqiad.wmflabs due to statistics::user - https://phabricator.wikimedia.org/T259307 (10Nuria)
[16:21:45] <wikibugs>	 10Analytics, 10VPS-Projects, 10Puppet: Puppet failing on wikistats.analytics.eqiad.wmflabs due to statistics::user - https://phabricator.wikimedia.org/T259307 (10Nuria) a:03razzi
[16:22:15] <wikibugs>	 10Analytics, 10Analytics-Kanban: Create new mailing list for analytics systems users - https://phabricator.wikimedia.org/T260849 (10Nuria) a:05Ottomata→03elukey
[16:43:51] <wikibugs>	 (03PS4) 10Ottomata: Add ProduceCanaryEvents job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/623448 (https://phabricator.wikimedia.org/T251609)
[16:46:13] <wikibugs>	 (03PS5) 10Ottomata: Add ProduceCanaryEvents job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/623448 (https://phabricator.wikimedia.org/T251609)
[16:49:56] <wikibugs>	 (03CR) 10Ottomata: Add ProduceCanaryEvents job (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/623448 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[16:50:24] <wikibugs>	 (03CR) 10Ottomata: Add ProduceCanaryEvents job (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/623448 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[16:51:28] <wikibugs>	 (03CR) 10Joal: Add ProduceCanaryEvents job (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/623448 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[16:55:03] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Add ProduceCanaryEvents job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/623448 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[16:57:42] <joal>	 thanks ottomata for accepting my nitpickyness :)
[16:59:51] <joal>	 meh - denormalize failed :(
[16:59:58] <elukey>	 joal: never seen any sign of nitpickiness from you
[17:00:06] <elukey>	 :P
[17:00:14] * elukey sends wikilove to joal
[17:00:20] <joal>	 <3
[17:00:26] <elukey>	 whatt it failed
[17:00:27] <elukey>	 sigh
[17:02:04] <milimetric>	 razzi: I gotta eat quickly and then, deploy?
[17:02:17] <milimetric>	 like maybe 20-30 min
[17:02:35] <razzi>	 milimetric: +1
[17:23:58] <wikibugs>	 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Cmjohnson) a:05Cmjohnson→03ayounsi @ayounsi can you add the analytics vlan to cloudsw-d5 please and these 2 servers to it's v...
[17:48:04] <milimetric>	 razzi: k, sorry, ran over
[17:48:18] <milimetric>	 cave?
[17:48:25] <razzi>	 cya there
[17:57:41] * elukey afk!
[18:00:20] <nuria>	 milimetric: did you kill the mediawiki denormalize job? mediawiki-history-denormalize-coord?
[18:00:53] <milimetric>	 nuria: nope
[18:01:07] <milimetric>	 (looking)
[18:01:12] <nuria>	 milimetric: it went kaput
[18:01:21] <milimetric>	 (looking)
[18:01:22] <nuria>	 milimetric: i have a 1 on 1 now but can talk after
[18:04:55] <joal>	 I'm looking into it now milimetric, nuria 
[18:05:19] <joal>	 The error raised by spark is one I see for the first time - Meh
[18:05:22] <milimetric>	 joal: invalid operation, it's not your ops week :)
[18:06:07] <joal>	 ack milimetric - I'll let you drive - let me help along the way as you need
[18:06:47] <milimetric>	 well, I always welcome help, so I'll think out loud here on the chat
[18:11:49] <milimetric>	 joal: ok, so reading a bit, it looks like errors fetching blocks.  Some say this happens from time to time and Spark recovers by retrying, but I'm seeing a lot of retries and all failures.  Someone on stackoverflow suspects a network problem and that sounds vaguely possible with the codfw switchover, but I'm not familiar enough with which of our hadoop boxes might be in codfw (ie not kafka clusters)
[18:12:11] <milimetric>	 TransportRequestHandler  - Error sending result
[18:12:19] <milimetric>	 OneForOneBlockFetcher  - Failed while starting block fetches
[18:12:34] <milimetric>	 TransportResponseHandler  - Still have 36 requests outstanding when connection from an-worker1091.eqiad.wmnet/10.64.36.115:7337 is closed
[18:12:44] <milimetric>	 ShuffleBlockFetcherIterator  - Failed to get block(s) from an-worker1082.eqiad.wmnet:7337
[18:13:03] <joal>	 milimetric: lately I have seen more errors in spark jobs, which I think are due to more people using spark and therefore more pressure being put on the shuffle-handler
[18:13:17] <milimetric>	 hm, but this should have priority
[18:13:36] <milimetric>	 but the shuffle handler doesn't care about that, only yarn, I see
[18:13:38] <joal>	 the job ran in the wrong queue actually (probably my bad) - we should restastrt it
[18:13:42] <milimetric>	 ah!
[18:13:55] <milimetric>	 ok, I'll restart, easy first step.  If it happens again, we think more about the network
[18:14:12] <joal>	 and indeed - shuffle-handler manages stuff for all spark jobs, and if the cluster is busy with spark doing a lot reading/writing, well there is pressure
[18:14:33] <joal>	 milimetric: other interesting finding - we're hitting https://issues.apache.org/jira/browse/SPARK-23243
[18:15:22] <joal>	 milimetric: failures in stages usually recover - but in our case, spark didn't want to recompute because it says there was an indeterministic step somewhere
[18:15:37] <milimetric>	 hm...
[18:16:16] <milimetric>	 joal: but this is in the production queue, no? https://hue.wikimedia.org/oozie/list_oozie_coordinator/0000553-200720135922440-oozie-oozi-C/
[18:16:44] <joal>	 milimetric: it should, but is not - look in the Configuration
[18:16:51] <joal>	 tab, the queue_name value
[18:16:55] <milimetric>	 oh default
[18:16:59] <milimetric>	 I saw queue: production
[18:17:03] <joal>	 indeed
[18:17:10] <joal>	 but queue_name: default
[18:17:21] <joal>	 and the one used b the job is queue_name
[18:19:02] <joal>	 milimetric: I support the idea of restarting and hoping
[18:19:12] <joal>	 If it fails again I'll investigate checkpointing
[18:19:16] <joal>	 milimetric: --^
[18:19:30] <milimetric>	 ok, restarting now
[18:19:50] <joal>	 milimetric: kill restart in prod queue?
[18:20:03] <milimetric>	 yeah, that's what I was gonna do... having some problems
[18:20:15] <joal>	 np - as long as we're on the same page :)
[18:20:20] <joal>	 thanks a lot milimetric 
[18:20:24] <milimetric>	 https://www.irccloud.com/pastebin/2K6uojbI/
[18:20:32] <joal>	 reading
[18:20:53] <joal>	 milimetric: LGTM!
[18:21:42] <milimetric>	 ah, I have to kinit to run kerberos-run-command, I am fuzzy on that
[18:22:03] <milimetric>	 anyway, https://hue.wikimedia.org/oozie/list_oozie_workflow/0074295-200720135922440-oozie-oozi-W/?coordinator_job_id=0074294-200720135922440-oozie-oozi-C
[18:22:35] <joal>	 great milimetric - I'm gonne gently look at that execution (not late)
[18:22:43] <joal>	 milimetric: would you please log the action?
[18:23:51] <joal>	 for instance milimetric: there currently are 2 relatively big spark jobs running from users on the cluster - and this puts pressure
[18:24:21] <milimetric>	 !log restarting mediawiki history denormalize coordinator in production queue, due to failed 2020-08 run
[18:24:23] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:24:29] <milimetric>	 good point, thx jho
[18:24:31] <milimetric>	 *jo
[18:30:31] <wikibugs>	 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Cmjohnson)
[18:36:01] <nuria>	 milimetric: read backscroll, got it
[18:36:28] <milimetric>	 :thumb:
[18:36:51] * milimetric mutters something about millenials and emojis
[18:49:01] <maryum>	 hello! how can I get ssh access to the analytics boxes? I'm on the search platform team and trying to use the flink cluster in hadoop
[18:50:14] <ottomata>	 nice!
[18:50:25] <ottomata>	 maryum:  https://wikitech.wikimedia.org/wiki/Analytics/Data_access
[18:50:32] <maryum>	 thanks a ton!
[18:50:37] <ottomata>	 you'll want to submit a ticket asking for analytics-privatedata-users
[18:51:30] <maryum>	 I don't want the one for the search platform team members? analytics-search-users
[18:51:41] <maryum>	 oh wait I see
[18:51:51] <maryum>	 hadoop
[18:52:22] <ottomata>	 you probalby want that too
[18:52:29] <ottomata>	 just in case :)
[18:54:14] <maryum>	 noted
[18:59:16] <wikibugs>	 10Analytics-Radar, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Cmjohnson)
[19:02:51] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker10[18-40] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @elukey Are you trying to re-use hostnames?  We should be using an-worker1118+
[19:04:03] <wikibugs>	 10Analytics, 10Patch-For-Review: Add urlshortener button to Turnilo - https://phabricator.wikimedia.org/T233336 (10CDanis) I think that idea could be reasonable... but is it too hard to get the original XFF header out of the user request made to Turnilo, and forward that?
[19:08:37] <wikibugs>	 10Analytics, 10Operations: eventgate-main latencies very high since the failover to codfw - https://phabricator.wikimedia.org/T261846 (10Ottomata) FYI I also increased partitions to 3 for resource_change as well.
[19:11:19] <maryum>	 so I'm already in that analytics private data users group....it might just be an error on my end, but we're not supposed to use the kerberos passwords to ssh onto the machines, correct? I haven't ssh'd to an analytics machine since the kerberos changes
[19:12:30] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "Just an idea, but +1 either way." (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/623586 (https://phabricator.wikimedia.org/T237047) (owner: 10Joal)
[19:18:35] <milimetric>	 I wanna 🚂🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃 but maybe it's a bit late...
[19:19:41] <wikibugs>	 10Analytics, 10Patch-For-Review: Add urlshortener button to Turnilo - https://phabricator.wikimedia.org/T233336 (10Milimetric) That would be nice.  We'd have to upstream a patch, but it would be a [[ https://github.com/allegro/turnilo/blob/8824d86d37e354baea45cff492e7e57154daab5d/src/server/routes/shorten/shor...
[19:27:39] <nuria>	 maryum: you can ssh normally
[19:27:48] <nuria>	 maryum: after you are in
[19:27:52] <nuria>	 in order to access data 
[19:27:58] <nuria>	 you need to type kinit
[19:28:15] <nuria>	 maryum: to set your kerberos credentials
[19:28:34] <nuria>	 maryum: so kerberos only keeps track of your data access
[19:28:41] <nuria>	 maryum: not your ssh keys
[19:29:52] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Technical contributors emerging communities metric definition, thick data - https://phabricator.wikimedia.org/T250284 (10Nuria) 05Open→03Resolved
[19:29:52] <wikibugs>	 10Analytics-Kanban, 10Analytics-Radar, 10Product-Analytics: Technical contributors metrics definition - https://phabricator.wikimedia.org/T247419 (10Nuria)
[19:33:56] <wikibugs>	 (03CR) 10Joal: [V: 03+1] Update drop-mediawiki-snapshots parameters and datasets (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/623586 (https://phabricator.wikimedia.org/T237047) (owner: 10Joal)
[19:34:28] <joal>	 thanks for the review ottomata - I'll try to spend a minute with Luca tomorrow thinking about the idea written in my comment-responce
[19:36:05] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Update drop-mediawiki-snapshots parameters and datasets (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/623586 (https://phabricator.wikimedia.org/T237047) (owner: 10Joal)
[19:52:04] <milimetric>	 ottomata: hey I won't have time to deploy source today, but I noticed there's no refinery job or puppet timer or anything that would use it, is that in a different patch somewhere or not yet done?
[19:52:57] <ottomata>	 not yet!
[19:53:01] <ottomata>	 will do after is deployed
[19:53:03] <ottomata>	 no thurry
[19:53:13] <ottomata>	 its a new job
[19:53:15] <ottomata>	 no puppet yet
[19:54:04] <milimetric>	 ok, cool, I'll deploy it either tonight or tomorrow morning
[20:01:42] <cdanis>	 ottomata: quick question, does/can eventgate add the client's IP address to a field?
[20:08:02] <ottomata>	 cdanis: it is set up to if the http.client_ip field is in the schema, yes, from the X-Client-IP header
[20:08:43] <cdanis>	 ah great ty
[20:08:43] <gehel>	 Pchelolo: did you create that ticket about the kafka_burrow not goign back to 0 on eqiad? I can't find it and ryankemper can't either
[20:08:54] <gehel>	 or do you have any more understanding of what's going on?
[20:10:59] <Pchelolo>	 gehel: the issue is change-prop, not burrow. https://phabricator.wikimedia.org/T261691
[20:13:42] <gehel>	 Pchelolo: cool, thanks! In the short term, is there something we should do? If I understand correctly, in the current situation, when we switch back to eqiad as main DC, we'l reprocess those 4k messages
[20:13:56] <gehel>	 in the case of Cirrus, that's not an issue
[20:14:27] <Pchelolo>	 gehel: not really reprocess them - they will be deduplicated
[20:14:41] <Pchelolo>	 so it's weird, but there's no consequences to this
[20:15:11] <gehel>	 they will have been processed in codfw and will be processed again in eqiad?
[20:53:51] <wikibugs>	 (03PS1) 10GoranSMilovanovic: minor [analytics/wmde/WD/WD_HumanEdits] - 10https://gerrit.wikimedia.org/r/623864
[20:54:06] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] minor [analytics/wmde/WD/WD_HumanEdits] - 10https://gerrit.wikimedia.org/r/623864 (owner: 10GoranSMilovanovic)
[20:59:33] <ottomata>	 cdanis:  sorry am in meetings and hangouts all day today!
[20:59:51] <ottomata>	 if you add the http fragment schema to your schema, eventgate-wikimdia will fill in some defaults in it
[21:00:15] <ottomata>	 https://gerrit.wikimedia.org/r/plugins/gitiles/eventgate-wikimedia/+/refs/heads/master/eventgate-wikimedia.js#361
[21:06:33] <cdanis>	 ottomata: oh, wonderful, I'll definitely do that
[21:07:40] <ottomata>	 gotta run byeeeeEEE
[21:09:10] <wikibugs>	 (03PS1) 10GoranSMilovanovic: Init [analytics/wmde/WD/WD_referenceHunt] - 10https://gerrit.wikimedia.org/r/623868
[21:09:19] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] Init [analytics/wmde/WD/WD_referenceHunt] - 10https://gerrit.wikimedia.org/r/623868 (owner: 10GoranSMilovanovic)
[21:18:29] <nuria>	 milimetric: did we deploy aqs?
[21:45:49] * nuria answering my own question: no