[02:11:55] <wikibugs>	 10Analytics-Radar: Presto error in Superest - only when grouping - https://phabricator.wikimedia.org/T270503 (10EYener) Hi @JAllemandou thanks for the reply! I am pulling this task back up and opened the dashboard to implement these suggestions. However, I encountered a new error on all charts:  presto error: Fa...
[07:18:20] <elukey>	 good morning
[07:33:06] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Collect metrics of all wikis [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/655886 (https://phabricator.wikimedia.org/T271894) (owner: 10WMDE-Fisch)
[07:33:17] <wikibugs>	 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey)
[07:33:48] <wikibugs>	 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey)
[07:52:40] <elukey>	 I am checking some rack availability for the new hadoop workers, and I found that in some we have more than 5 workers
[07:52:46] <elukey>	 no bueno
[07:57:17] <elukey>	 I am trying to spread nodes evenly on rows so a rack down with say 7 nodes on top shouldn't cause a ton of issues, but it is not great either
[08:01:12] <elukey>	 ah no max seems to be 6
[08:06:21] <elukey>	 no sigh 7 in rack C4
[08:31:28] <elukey>	 ok completed the review, overall after the recent workers addition we have
[08:31:31] <elukey>	      19 A
[08:31:33] <elukey>	      19 B
[08:31:33] <elukey>	 that looks very good
[08:31:36] <elukey>	      21 C
[08:31:38] <elukey>	      19 D
[08:31:46] <elukey>	 so the new 6 nodes can be spread anywhere
[08:31:54] <elukey>	 will comment in the task
[08:38:26] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) Hi @wiki_willy thanks a lot for following up!   I re-done the calculations of the workers' distribution after the last racking and this is what I g...
[08:38:40] <elukey>	 added my notes to --^
[09:17:10] <wikibugs>	 10Analytics, 10Performance-Team: Coal graphs died around 2021-01-26 20:50 UTC - https://phabricator.wikimedia.org/T273033 (10Gilles)
[09:26:14] <wikibugs>	 10Analytics, 10Performance-Team, 10Patch-For-Review: Coal graphs died around 2021-01-26 20:50 UTC - https://phabricator.wikimedia.org/T273033 (10Gilles) Seems like coal simply needed to be restarted, it hadn't been since python3-snappy was installed on the host a few days ago for navtiming's sake. Won't hurt...
[09:27:32] <wikibugs>	 10Analytics: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10elukey) I had some thoughts about bottlenecks and the only one that came to mind, not mentioned in the description of the task, is the database. The only an-airflow...
[09:36:49] <wikibugs>	 10Analytics, 10Performance-Team, 10Patch-For-Review: Coal graphs died around 2021-01-26 20:50 UTC - https://phabricator.wikimedia.org/T273033 (10Gilles) p:05Triage→03High
[09:44:01] <wikibugs>	 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10elukey)
[10:22:29] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Data, and 3 others: EventLogging PHP EventServiceClient should use EventBus->send(). - https://phabricator.wikimedia.org/T272863 (10hashar) That is the MediaWiki installer failing: `counterexample * A dependency error was encount...
[10:26:09] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Data, and 5 others: EventLogging PHP EventServiceClient should use EventBus->send(). - https://phabricator.wikimedia.org/T272863 (10hashar)
[10:37:39] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Data, and 5 others: EventLogging PHP EventServiceClient should use EventBus->send(). - https://phabricator.wikimedia.org/T272863 (10hashar) The CI config change to add EventBus to the wmf-quibble* jobs is https://gerrit.wikimedia...
[11:15:56] <elukey>	 !log add client_port and debug fields to X-Analytics in webrequest varnishkafka streams
[11:15:58] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:21:14] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add client TCP source port to webrequest - https://phabricator.wikimedia.org/T271953 (10elukey) Both changed deployed by Valentin, I checked the client_port field in webrequest_text on Kafka and it works nicely. The debug header needs to be triggered by an...
[12:23:08] * elukey lunch!
[12:30:01] <klausman>	 Same.
[12:42:21] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Data, and 5 others: EventLogging PHP EventServiceClient should use EventBus->send(). - https://phabricator.wikimedia.org/T272863 (10hashar) On the dummy change https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/6589...
[13:02:39] <joal>	 !log Copy /wmf/data/event to backup cluster (30Tb) - T272846
[13:02:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:02:42] <stashbot>	 T272846: Backup HDFS data before BigTop upgrade - https://phabricator.wikimedia.org/T272846
[13:36:14] <dsaez>	 hey a-team, good morning/afternoon/evening . I'm having issues with the pageviews API, it works correctly from the browser, but using python I get this error:  https://pastebin.pl/view/7fa0efeb
[13:36:32] <dsaez>	 I'm wondering if there is any user agent issue
[13:37:29] <mforns>	 hi dsaez :] looking into this
[13:38:03] <dsaez>	 thx mforns
[13:38:32] <elukey>	 dsaez: yes it is me
[13:39:23] <elukey>	 or it should be me, let's try to see :)
[13:39:33] <elukey>	 are you using python-requests?
[13:39:52] <elukey>	 because we added a specific block in Varnish the other day after a big surge in traffic
[13:40:05] <elukey>	 following https://meta.wikimedia.org/wiki/User-Agent_policy
[13:40:20] <elukey>	 so the block returns a 403 in this case but it should mention the UA policy
[13:40:26] <elukey>	 that I don't see in your paste
[13:40:31] <elukey>	 what is the HTTP error code returneD?
[13:41:09] <elukey>	 also, can you give us the link to check?
[13:41:16] <wikibugs>	 10Analytics: Add user to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10gmodena)
[13:41:35] <mforns>	 elukey: it's a 403
[13:41:43] <dsaez>	 yep 403
[13:41:56] <dsaez>	 I'm using requests
[13:42:13] <dsaez>	 for example https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/de.wikipedia/all-access/user/Johann_Wolfgang_von_Goethe/daily/2015101300/2015102700 
[13:42:19] <dsaez>	 that works from the browser
[13:42:22] <mforns>	 the error message mentions indeed: Scripted requests from your IP have been blocked, please see https://meta.wikimedia.org/wiki/User-Agent_policy.
[13:43:39] <dsaez>	 but this: requests.get(that_url) returns the error
[13:44:26] <dsaez>	 elukey, sorry, I don't get it. This is an API, so which is the expected UA ?
[13:45:31] <mforns>	 dsaez: The generic format is <client name>/<version> (<contact information>) <library/framework name>/<version> [<library name>/<version> ...]. Parts that are not applicable can be omitted. See: https://meta.wikimedia.org/wiki/User-Agent_policy
[13:45:42] <elukey>	 mforns: ah didn't see that yes
[13:46:16] <mforns>	 dsaez: I think you can use requests to send user agent:
[13:46:20] <elukey>	 dsaez: You'd need to provide a UA that can tell us how to contact you in case the volume of requests is big
[13:46:42] <mforns>	 response  = requests.get(url, headers = {'User-agent': 'blahblah'})
[13:47:02] <elukey>	 this is far from perfect, we may lift the block very soon (it was due to emergency) but in general we should follow the UA policy for the APIs
[13:47:51] <dsaez>	 got it. Sounds very strict, I've done two calls. 
[13:49:02] <elukey>	 yes yes I know, we also have to figure out throttling, it is a temporary measure
[13:49:15] <elukey>	 but in the long term we suggest to everybody to use a proper UA
[13:49:28] <dsaez>	 got it
[13:49:29] <dsaez>	 in fact
[13:49:33] <dsaez>	 is not blocked
[13:49:57] <dsaez>	 if I add the blahblah agent, is enough 
[13:50:08] <elukey>	 yes please use a better UA :D
[13:50:13] <dsaez>	 hahaha
[13:50:14] <dsaez>	 sure
[14:02:27] <wikibugs>	 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10serviceops, and 2 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10jijiki)
[14:02:30] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10serviceops, 10User-jijiki: Mechanism to flag webrequests as "debug" - https://phabricator.wikimedia.org/T263683 (10jijiki) 05Open→03Resolved @Milimetric  patch is merged! We are setting  debug=1 in the X-Analytics header if "X-Wikimedia-Debug" is present. Thank you fo...
[14:03:50] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add client TCP source port to webrequest - https://phabricator.wikimedia.org/T271953 (10jijiki) Debug header works, we tested it with @elukey:)
[14:04:57] <elukey>	 joal: are you around?\
[14:24:29] <elukey>	 joal: I killed the copy (client + map-reduce job), we were causing network alarms :(
[14:53:07] <wikibugs>	 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10akosiaris) This is weird. I don't think we have encountered this before.   ExecStop in the systemd unit file runs `ifdown ens5` but running that on the host returns  ` root@kafka-test1006:...
[14:53:39] <wikibugs>	 10Analytics: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Ottomata) >  if we want to have two/three more Airflow instances Do we want/need this?  > store a little mariadb instance on every deployment of Airflow, getting re...
[15:00:13] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Analytics, 10Product-Infrastructure-Data: MEP: Should stream configurations be written in YAML? - https://phabricator.wikimedia.org/T269774 (10Ottomata) > Create a new repo for stream configs and add it as a git submodule to operations/mediaw...
[15:14:06] <wikibugs>	 10Analytics: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10elukey) @Ottomata the main problem that I can see how is that multi-tenancy is not really something that Airflow does well (and the people from Polidea confirmed th...
[15:17:34] <wikibugs>	 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey) @akosiaris not reliably, but today I rebooted the 4 schema VMs and one of them got back with the same issue..
[15:24:53] <wikibugs>	 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10MoritzMuehlenhoff) >>! In T273026#6780528, @akosiaris wrote: > This is weird. I don't think we have encountered this before.  >  > ExecStop in the systemd unit file runs `ifdown ens5` but...
[15:28:31] <joal>	 heya elukey 
[15:28:37] <joal>	 sorry I'm with kids
[15:28:41] <joal>	 good that you killed it
[15:28:52] <joal>	 Let's review togother when I have time
[15:30:46] <elukey>	 ack! I just pinged if you were around, I used the hammer :D
[15:38:14] <wikibugs>	 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10akosiaris) >>! In T273026#6780640, @MoritzMuehlenhoff wrote: >>>! In T273026#6780528, @akosiaris wrote: >> This is weird. I don't think we have encountered this before.  >>  >> ExecStop in...
[15:39:11] <wikibugs>	 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey) I recall VMs only from my past experience, I encountered this problem a couple of times before this one.
[15:45:08] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10JMeybohm) >>! In T269160#6777382, @elukey wrote: > Waiting for @JMeybohm'...
[15:46:24] <wikibugs>	 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10elukey) @Ottomata @razzi this is the first datanode disk failure after the change that I made to use facter to populate the available partitions that Yarn and HDFS can use on a given worker node. In...
[15:47:22] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) >>! In T269160#6780685, @JMeybohm wrote: >>>! In T269160#6777382,...
[15:47:31] <wikibugs>	 10Analytics, 10Event-Platform: Rematerialise all event schemas with enforceNumericBounds: true - https://phabricator.wikimedia.org/T273069 (10Ottomata)
[15:47:37] <wikibugs>	 10Analytics, 10Event-Platform: Rematerialize all event schemas with enforceNumericBounds: true - https://phabricator.wikimedia.org/T273069 (10Ottomata)
[15:49:36] <wikibugs>	 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10MoritzMuehlenhoff) >>! In T273026#6780670, @akosiaris wrote: > Do you by any chance remember if it was on VMs only? Or was it physical hosts too?  From my memory only VMs. I've checked my...
[15:50:54] <elukey>	 ottomata: if you are ok I'd heml eventstreams-internal! 
[15:51:30] <elukey>	 cd /srv/deployment-charts/helmfile.d/services/eventstreams-internal; helmfile -e codfw -i apply
[15:51:33] <elukey>	 and then eqiad
[15:51:39] <elukey>	 does it sound ok?
[15:51:58] <ottomata>	 go for it!
[15:51:59] <ottomata>	 yes!
[15:52:10] <ottomata>	 (no lvs yet, right?
[15:52:11] <ottomata>	 )
[15:52:49] <ottomata>	 not sure if i can test very easily without, would have to do some curl --resolve magic and look up lots of stuff, but if the kube logs look good we can assume i tworks
[15:52:58] <ottomata>	 will look at logs after you apply
[15:53:30] <elukey>	 no lvs exactly
[15:54:56] <elukey>	 ok we can start with
[15:54:56] <elukey>	  Error: pods is forbidden: User "eventstreams-internal" cannot list resource "pods" in API group "" in the namespace "eventstreams-internal"
[15:56:56] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) ` Error: pods is forbidden: User "eventstreams-internal" cannot l...
[16:00:39] <wikibugs>	 10Analytics, 10Performance-Team: Coal graphs died around 2021-01-26 20:50 UTC - https://phabricator.wikimedia.org/T273033 (10Gilles) 05Open→03Resolved a:03Gilles Restarting coal fixed the data, as expected:  {F34044291}
[16:01:02] <elukey>	 ah I may know why
[16:01:26] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10JMeybohm) You probably have not yet depoyed the admin part (the new names...
[16:04:19] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) >>! In T269160#6780761, @JMeybohm wrote: > You probably have not...
[16:06:48] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10JMeybohm) Apart from you testing my attention again (kube_env admin [codf...
[16:13:09] <wikibugs>	 (03PS1) 10Mforns: Make HiveToDruid return exit code when deployMode=client [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/659017 (https://phabricator.wikimedia.org/T271568)
[16:14:23] <ottomata>	 elukey:  should we start refering to presto as trino?
[16:14:34] <ottomata>	 was thikning about adding presto support to wmfdata python
[16:14:40] <ottomata>	 looked fro client
[16:14:41] <ottomata>	 https://github.com/trinodb/trino-python-client
[16:14:44] <ottomata>	 looks like the one maybe
[16:15:19] <elukey>	 ottomata: to avoid too much work, I'd just upgrade to the latest presto (fb presto) and then think about migrating to trino later
[16:15:31] <elukey>	 I thought we agreed on this during a standup :D
[16:19:06] <wikibugs>	 10Analytics, 10Patch-For-Review: Follow up on Druid alarms not firing when Druid indexations were failing due to permission issues - https://phabricator.wikimedia.org/T271568 (10mforns) After some tests, I think the problem lies in the code: ` if (spark.conf.get("spark.master") != "yarn") {     sys.exit(if (su...
[16:19:45] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (10Ottomata)
[16:19:50] <ottomata>	 elukey:  my memory is poor
[16:20:09] <ottomata>	 if did presto in wmfdata then, should I use https://github.com/prestodb/presto-python-client instead?
[16:20:30] <ottomata>	 trino one has more recent commits
[16:20:52] <elukey>	 yes that client should be ok in my opinion
[16:21:19] <elukey>	 to clarify - if we want to move to trino I am 100% onboard, it seemed only too much for us 
[16:21:38] <elukey>	 but if you want to move to trino +1
[16:23:04] <ottomata>	 elukey:  naw i'm not trying to expedite move to it
[16:23:12] <ottomata>	 just wondering what our language should be, but 
[16:23:21] <ottomata>	 it sounds like for my q: we should keep saying 'presto' 
[16:23:27] <ottomata>	 i can use a trino client now
[16:23:38] <ottomata>	 and later when we change rename to 'trino' in wmfdata
[16:23:40] <ottomata>	 e.g. ^
[16:32:48] <joal>	 Here I am
[16:33:18] <joal>	 elukey: Hi :)
[16:33:29] <joal>	 elukey: I'm sorry again about the network mess :(
[16:36:00] <joal>	 razzi: Hello :) would you have a minute for me?
[16:47:52] <elukey>	 joal: not your fault :)
[16:48:00] <joal>	 I wondered :S
[16:48:06] <joal>	 :)
[16:48:10] <elukey>	 no I mean it was the data copy
[16:48:14] <joal>	 I probably shouldn't be back :)
[16:48:19] <elukey>	 but you didn't really do it on purpose
[16:48:20] <joal>	 It was elukey 
[16:48:23] <elukey>	 so not your fault :)
[16:48:25] <joal>	 well, I did!
[16:48:35] <elukey>	 uffff
[16:48:47] <elukey>	 I strongly disagree :D
[16:49:00] <joal>	 We knew it would put load on the network - We just didn't know how much and how much was too much :)
[16:49:01] <elukey>	 but I cannot really convince you otherwise :D
[16:49:04] <joal>	 hehehe :)
[16:49:27] <joal>	 anyway - Shall I try with half the number of mappers?
[16:50:47] <joal>	 elukey: --^
[16:51:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Completely ignorant about this but the option looks present for 2.4 and it makes sense to me, thanks Marcel!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/659017 (https://phabricator.wikimedia.org/T271568) (owner: 10Mforns)
[16:52:14] <mforns>	 thanks for the CR elukey :]
[16:53:38] <elukey>	 joal: yes let's try!
[16:53:40] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM!Thanks @mforns" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/659017 (https://phabricator.wikimedia.org/T271568) (owner: 10Mforns)
[16:53:46] <joal>	 ack elukey - launching the thing
[16:53:51] <elukey>	 joal: is there a way to throttle it a bit too?
[16:54:17] <mforns>	 thx for CR joal, do you know why we are not returning exit code inside YARN?
[16:55:28] <joal>	 mforns: I imagine we could, but there would be no way to actually take advantage of it I think
[16:56:26] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) es-internal deployed in both eqiad and codfw, next steps are:  -...
[16:56:37] <mforns>	 joal: aha
[16:58:30] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (10mforns)
[16:58:40] <wikibugs>	 10Analytics: Filter out webrequest where debug=1 from pageview - https://phabricator.wikimedia.org/T273083 (10JAllemandou)
[17:00:11] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) @elukey [[ https://logstash.wikimedia.org/goto/b408da9f4b39f66a...
[17:01:58] <ottomata>	 fdans: milimetric joal yoohoo!
[17:02:04] <joal>	 elukey: file-listing done, actual copy starting
[17:02:12] <joal>	 ottomata: tuning-session!
[17:02:23] <ottomata>	 oh ho ok
[17:02:53] <joal>	 elukey: 8.8M files to be copied
[17:03:33] <wikibugs>	 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10Legoktm) p:05Triage→03Low
[17:04:31] <joal>	 elukey: I also have a question when ou have a minute
[17:05:43] <mforns>	 joal: ping standup?
[17:05:58] <joal>	 mforns: tuning session? shall I maybe not be there?
[17:06:06] <joal>	 fdans: --^ ?
[17:06:12] <mforns>	 oh!
[17:10:29] <wikibugs>	 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10akosiaris) I 'll take your word for it. +1 on the cleanup thing.
[17:13:40] <elukey>	 Amir1: the client_port flag is now in new webrequest data, so if you need to check/use it you can :)
[17:13:50] <elukey>	 what is the ideal use case? Query via Superset?
[17:13:55] <elukey>	 or do you use hive via cli?
[17:13:59] <elukey>	 (or even presto)
[17:14:00] <Amir1>	 Awesome
[17:14:05] <Amir1>	 I do hive
[17:14:07] <Amir1>	 beeline
[17:14:19] <elukey>	 perfect
[17:14:23] <joal>	 Amir1: I suggest you try spark ;)
[17:14:29] <Amir1>	 I need to ask the cu in ukwiki
[17:14:46] <Amir1>	 usually yes but this one is a specific problem :D
[17:15:03] <Amir1>	 Thank you!
[17:15:28] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (10Ottomata) a:05Gilles→03Ottomata
[17:18:15] <joal>	 elukey: I just restarted the copy job - I realized I messed up and had not changed the number of mappers :(
[17:19:12] <Amir1>	 https://phabricator.wikimedia.org/T265692#6781099 let the CU know 
[17:27:08] <wikibugs>	 10Analytics, 10SRE: archiva artifact links point to 127.0.0.1 - https://phabricator.wikimedia.org/T164993 (10elukey)
[17:34:50] <joal>	 razzi: not sure if you got my previous ping with the irc issues - trying again
[17:35:37] <razzi>	 joal: didn't see the ping, please go again :)
[17:35:43] <joal>	 Hi razzi :)
[17:35:49] <joal>	 I have a questio
[17:35:53] <joal>	 if you have a minute
[17:36:07] <razzi>	 indeed I do
[17:36:28] <joal>	 razzi: Can you confirm that user eyener is in analytics-privatedata-users group?
[17:36:50] <joal>	 I think elukey told me 10 times ho to do it, and I still can't recall :(
[17:37:59] <razzi>	 joal: I can confirm that user is in analytics-privatedata-users by running `groups eyener`
[17:38:54] <joal>	 ack razzi - I wouldn't have expected I can run groups command as not-root - Thanks a lot!!
[17:39:19] <razzi>	 you're welcome :)
[17:56:23] <joal>	 eyener: Hi! I'm reading your comment on the presto error ticket
[18:03:03] <elukey>	 razzi: do you want to reboot an-launcher1002?
[18:03:36] <razzi>	 elukey: yeah, bc?
[18:03:58] <razzi>	 elukey: or maybe it's not that involved and we can do so async
[18:04:09] <elukey>	 razzi: I think that we can do it in here if you are ok
[18:05:57] <elukey>	 razzi: to recap - first thing is to check what's running with 'systemctl list-timers'
[18:06:33] <elukey>	 we have to identify the prefixes to stop
[18:06:44] <elukey>	 ah also, let's disable puppet
[18:06:58] <elukey>	 with something like "Razzi - prepping for reboot"
[18:07:06] <razzi>	 elukey: sounds good
[18:07:13] <elukey>	 one first example could be
[18:07:27] <elukey>	 sudo systemctl stop 'reportupdater-*.timer'
[18:07:41] <elukey>	 the important bit here is the .timer at the end
[18:07:59] <elukey>	 since if you do stop reportupdater-* you'll target the service, that might be running
[18:08:06] <elukey>	 we want to stop scheduled executions
[18:08:16] <elukey>	 (and basically gently draining)
[18:08:41] <elukey>	 eventually you'll end up with systemctl list-timers showing only system level timers
[18:08:44] <elukey>	 like logrotate etc..
[18:08:47] <elukey>	 that are fine to run
[18:09:04] <elukey>	 once done, we'll need to check if any java/python processes are running 
[18:09:20] <elukey>	 if yes, let's wait until the finish, otherwise green light to reboot
[18:09:33] <elukey>	 then puppet enable + run and the maintenance is done :)
[18:09:35] <razzi>	 I don't see reportupdater- timers in systemctl list-timers
[18:09:49] <elukey>	 Wed 2021-01-27 19:00:00 UTC  53min left          Wed 2021-01-27 18:00:00 UTC  6min ago           reportupdater-browser.timer 
[18:10:03] <elukey>	 on what host are you?
[18:10:09] <razzi>	 :) an-master oops
[18:10:18] <elukey>	 ah yes it makes sense then :D
[18:10:35] <razzi>	 Reenabled puppet, now on to an-launcher1002
[18:10:51] <wikibugs>	 10Analytics-Radar: Presto error in Superest - only when grouping - https://phabricator.wikimedia.org/T270503 (10JAllemandou) Hi @EYener  > presto error: Failed to list directory: hdfs://analytics-hadoop/wmf/data/event_sanitized/CentralNoticeBannerHistory/year=2021/month=1/day=9/hour=21  I have not experienced th...
[18:12:42] <joal>	 elukey: are we still ok in term of network?
[18:14:36] <elukey>	 joal: it seems so yes, no complains about links saturation
[18:14:46] <joal>	 ack elukey - thanks for checking
[18:15:02] <joal>	 elukey: something else if you may?
[18:15:58] <elukey>	 joal: sure what's up
[18:16:44] <joal>	 elukey: we're gonna need all users setup on the backup cluster :(
[18:17:16] <joal>	 elukey: the /user folder is looking wrong despite me having resync
[18:17:24] <joal>	 in terms of ownership
[18:17:36] <elukey>	 joal: you wiped and re-copied right?
[18:17:52] <joal>	 elukey: I distcp -update - which should do the same
[18:18:53] <eyener>	 joal Awesome! You fixed it! :)  I am not sure what the issue was but every chart in that dash was failing to load yesterday
[18:19:20] <joal>	 eyener: eh :) Fixing without touching is m prefered way - ususaly doesn't work though :)
[18:20:02] <joal>	 thanks for letting me know eyener - sorry for no good answer on updating charts (yet)
[18:21:20] <eyener>	 Ha no worries joal - appreciate you checking it out. I've asked in the Superset slack workspace as well and haven't received a reply but I'll let you know if I ever figure it out
[18:21:29] <eyener>	 maybe some jinja templating or something...?
[18:22:11] <joal>	 very possible eyener - /me is no superset ninja for sure
[18:22:13] <elukey>	 joal: not sure, have you tried to explicitly wipe and copy a single user dir? Just to see if perms are weird
[18:22:28] <elukey>	 in theory users are already deployed on the cluster, on all nodes
[18:22:31] <elukey>	 masters + workers
[18:22:35] <joal>	 MQH
[18:22:37] <joal>	 MEH
[18:22:50] <wikibugs>	 10Analytics, 10Product-Infrastructure-Data, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Create a separate logstash ElasticSearch index for schemaed events - https://phabricator.wikimedia.org/T265938 (10Ottomata) In a meeting with devs doing client error logging today, we realized that conf...
[18:23:06] <joal>	 elukey: I'll try wipe-out for real and see if it changes anything
[18:23:29] <joal>	 elukey: and I'll use 64 mappers as my basis
[18:24:01] <elukey>	 perfect thanks
[18:24:11] <elukey>	 if it doesn't work we can check again but it is weird
[18:24:16] <joal>	 sure elukey 
[18:24:33] <joal>	 thanks for confirming that the hardware should be ready
[18:25:15] <mforns>	 joal, I believe the changes you did to hdfs cleaner need to be deployed?
[18:25:40] <joal>	 mforns: I think elukey did?
[18:25:47] <joal>	 maybe not?
[18:26:01] <mforns>	 joal: isn't the hdfs cleaner in refinery repo?
[18:26:20] <elukey>	 mforns: yep the three timers have been deployed
[18:26:34] <mforns>	 ok elukey thanks
[18:26:38] <joal>	 mforns: I have not changed the code - only added puppet stuff :)
[18:26:46] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10wiki_willy) Hi @elukey - thanks for the mapping.  What makes it tough is that the remaining 6x hosts need to be on 10g switches, which really limits our op...
[18:26:46] <mforns>	 ok ok :]
[18:26:54] <joal>	 thanks for checking mforns 
[18:31:52] <elukey>	 mforns: in theory I should be on-call now right? Anything to handover?
[18:31:59] <elukey>	 forgot to ask during standup
[18:32:06] <elukey>	 razzi: how are things going?
[18:32:15] <mforns>	 elukey:  no no, it's tomorrow
[18:32:33] <razzi>	 elukey: good, have stopped some more timers, still going through the list
[18:32:45] <elukey>	 okok
[18:36:47] <razzi>	 I believe the following services should be kept, am I missing any?
[18:36:47] <razzi>	 export_smart_data_dump.service
[18:36:47] <razzi>	 logrotate.service
[18:36:47] <razzi>	 man-db.service
[18:36:47] <razzi>	 systemd-tmpfiles-clean.service
[18:38:19] <razzi>	 oh and apt-daily.service and apt-daily-upgrade.service
[18:38:30] <elukey>	 yes yes
[18:38:45] <elukey>	 the only one that you missed is the hdfs-cleaner-*
[18:38:59] <elukey>	 those are the periodical jobs that clean up some dirs in hdfs
[18:40:29] <elukey>	 razzi: --^
[18:41:36] <razzi>	 cool
[18:45:34] <elukey>	 razzi: can you stop them?
[18:45:45] <elukey>	 so we can proceed with the next steps :)
[18:46:08] <razzi>	 yes yes, got distracted
[18:47:51] <elukey>	 razzi: also there are mediawiki-* and hdfs-balancer
[18:48:30] <elukey>	 we should really think about changing the names, adding something line analytics- in front
[18:48:34] <razzi>	 How about prometheus-nic-firmware-textfile / prometheus_intel_microcode?
[18:49:17] <elukey>	 those are fine, the prometheus exporters can be left aside
[18:49:23] <elukey>	 they just expose metrics
[18:49:30] <razzi>	 ok cool
[18:50:00] <elukey>	 then we need to make sure that no java/python processes are running, and if so we'd need to wait
[18:50:33] <razzi>	 so wait should hdfs-balancer and mediawiki* be stopped?
[18:54:05] <elukey>	 yep yep
[18:54:16] <elukey>	 those don't need to run while we reboot
[18:55:50] <razzi>	 ok should be all set to reboot
[18:57:10] <elukey>	 razzi: what about java/python processes running?
[18:57:16] <razzi>	 oh right
[18:57:38] <elukey>	 also you didn't stop the hdfs-cleaner timers
[18:58:21] <mforns>	 ottomata: not sure if I need a +1 for these, but just in case, can you look? :] 
[18:58:23] <mforns>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/659022
[18:58:28] <mforns>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/658426
[19:00:15] <razzi>	 elukey: alright, stopped
[19:01:59] <fkaelin>	 We would like to run a spark job that downloads all commons images from swift and stores the base64 image bytes in a column on hdfs, there will be roughly 7TB of data. Is there a recommended folder to store such a dataset, ie so that the size will not cause problems and it is available for others on the team?
[19:04:28] <elukey>	 fkaelin: I suspect that you are working with Miriam :D
[19:05:34] <elukey>	 fkaelin: so there are a couple of things to check - how many files are we talking about? (the hdfs namenode suffers a bit when we add million files more etc..)
[19:05:45] <elukey>	 I am more concerned about that than those 7TB of space
[19:06:04] <elukey>	 razzi: so next steps? :)
[19:06:31] <razzi>	 I see a couple of python processes: eventlogging_to_druid_navigationtiming_hourly and eventlogging_to_druid_navigationtiming_daily
[19:06:31] <razzi>	 and a couple java ones org.wikimedia.analytics.refinery.job.HiveToDruid
[19:06:31] <razzi>	 No idea how long they'll take to finish
[19:06:47] <elukey>	 razzi: perfect
[19:06:54] <elukey>	 one thing to check is when they started
[19:07:02] <elukey>	 one on Jan25
[19:07:13] <elukey>	 the other on Jan21
[19:07:30] <elukey>	 or not sorry lemme check better
[19:07:34] <elukey>	 I might say something silly
[19:08:20] <elukey>	 mmm yes weird they have been running for a while
[19:08:50] <elukey>	 mforns: holaaaaa
[19:08:55] <elukey>	 do you have a min?
[19:09:36] <elukey>	 the navtiming hourly + daily hive2druid indexations seem taking a lot of times, they started hours and hours ago
[19:09:45] <elukey>	 has it ever happened that they got stuck?
[19:10:17] <fkaelin>	 elukey yes, that is work with miriam. the image bytes will be stored as byte64 encoded strings in a schema, so the number of files depends on the whatever blocksize hadoop/spark chooses
[19:11:39] <elukey>	 fkaelin: okok so 7TB is a bit but we have a lot of space, and it is a one off, the only thing that we should check is how many files will be generated.. if it is say 10 millions it might be a problem, if we are taking about few thousand I think it is fine
[19:12:20] <elukey>	 fkaelin: can we run a test on a subset of data to see how many files are generated?
[19:13:26] <elukey>	 our blocksize for hadoop is 256M IIRC
[19:14:06] <elukey>	 razzi: since we cannot leave things stopped for so much, let's reboot an-launcher1002
[19:14:15] <elukey>	 those two jobs seem stuck
[19:14:25] <elukey>	 (we need to downtime first)
[19:19:50] <elukey>	 razzi: I am rebooting the node myself, we should not wait this long
[19:20:13] <elukey>	 we stopped camus for a long time and when it restart it lags for a while
[19:20:26] <elukey>	 so when doing maintenance let's focus on the task please :)
[19:21:37] <razzi>	 elukey: alright yeah
[19:24:49] <mforns>	 elukey: in meeting! it finishes in 25mins
[19:25:03] <elukey>	 mforns: all good! We can follow up tomorrow
[19:25:27] <mforns>	 elukey: but yes, it happened start of the year!
[19:25:42] <elukey>	 sigh :(
[19:25:55] <elukey>	 razzi: ok host is up, can you re-enable and run puppet?
[19:27:16] <razzi>	 elukey: re-enable timers via systemctl start?
[19:27:35] <elukey>	 razzi: a puppet run is sufficient to restore all puppet-defined timers
[19:27:50] <razzi>	 gotcha, that makes sense
[19:27:56] <ottomata>	 sorry mforns  looks like you got em +ed :)
[19:28:12] <mforns>	 ottomata: yes, no problemo, they deployed :]
[19:29:04] <ottomata>	 razzi: hm i did migrate a bunch of navigationtiming data to event platform today!
[19:29:11] <ottomata>	 i wouldn't expect it tocause issue
[19:29:13] <ottomata>	 but..would it?
[19:29:32] <ottomata>	 mforns: can you think of anything i hive to druid that woudl need to be cahnged to deal with events with migrated schema?
[19:29:37] <ottomata>	 the hive table was migrated yesterday
[19:29:39] <elukey>	 one job was stuck since the 21st :(
[19:29:46] <elukey>	 the other from the 25th
[19:29:55] <mforns>	 ottomata: on meeting, but will repond in a bit!
[19:30:13] <ottomata>	 hm yeah i didn't touch navigation timing until yesterday 
[19:30:15] <ottomata>	 also meting! :)
[19:30:24] <elukey>	 all right I am going to dinner, ttl!
[19:30:29] <razzi>	 cya elukey 
[19:31:07] <ottomata>	 l8rs
[19:31:23] <wikibugs>	 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Product-Infrastructure-Data: Roll-up raw sessionTick data into distribution - https://phabricator.wikimedia.org/T271455 (10sdkim) a:05mforns→03Mayakp.wiki
[19:37:22] * razzi afk for lunch
[20:02:37] <joal>	 gone for tonight team - see you tomorrow
[20:09:03] <fkaelin>	 elukey for the tests I used the default blocksize which seems to be 64mb. So for 7TB of data we are looking at ~100k files, or ~25k if we set the blocksize to 256MB.  
[20:12:26] <fkaelin>	 elukey the job will run over a couple days on a small number of workers (aiming for ~100qps to swift), so the hdfs files will be created at a slow pace. 
[20:27:16] <wikibugs>	 10Analytics: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10EBernhardson) Concur with regard to multi-tenancy, I tried to setup our airflow initially in a way that used the builtin multi-tenancy but as soon as I started inte...
[20:50:34] <mforns>	 ottomata: I don't see any thing that would need to be changed for HiveToDruid re. migrated schemas...
[20:50:58] <mforns>	 ottomata: maybe the only thing would be the meaning of dt field
[20:51:22] <mforns>	 but IIUC the meaning of dt does not change rright?
[20:52:28] <mforns>	 and all other fields are available with the same name in a backwards compatible way... so, I'd say no changes needed
[20:54:26] <eyener>	 joal if you're around, I'm getting another iteration of the `presto error: Failed to list directory: hdfs://analytics-hadoop/wmf/data/event_sanitized/CentralNoticeBannerHistory/year=2021/month=1/day=9/hour=1` error when I try to edit the Banner History dash
[20:57:45] <wikibugs>	 (03PS1) 10Mforns: Add en.wikidata to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/659081
[20:59:58] <ottomata>	 mforns:  no its the same with legacy data
[21:00:07] <ottomata>	 dt only means event time for new schemas
[21:02:49] <mforns>	 I see ottomata, HiveToDruid will work for new schemas the same, the only difference (if we want to use a time field other than dt for a given dataset) would be we have to explicitly specify it from druid_load.pp (which is already supported)
[21:03:03] <ottomata>	 cool!  
[21:08:05] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (10mforns)
[21:08:49] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform: MobileWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T267347 (10mforns)
[21:09:00] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: DesktopWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T271164 (10mforns)
[21:15:58] <wikibugs>	 10Analytics, 10Better Use Of Data: Create Oozie job for session length - https://phabricator.wikimedia.org/T273116 (10mforns)
[21:33:53] <ottomata>	 fkaelin: default block size shoudl be 256MB
[21:33:55] <ottomata>	 https://yarn.wikimedia.org/conf
[21:33:59] <ottomata>	 dfs.blocksize
[21:38:43] <ottomata>	 mforns: tomorrow my morningg i'm going to migrate my nav timing schemas to all wikis, if you are around we can do yours at the same time (without a deployment window)