2021-01-27 02:11:55
|
<wikibugs>
|
'Analytics-Radar: Presto error in Superest - only when grouping - https://phabricator.wikimedia.org/T270503 (''EYener) Hi @JAllemandou thanks for the reply! I am pulling this task back up and opened the dashboard to implement these suggestions. However, I encountered a new error on all charts: presto error: Fa...'
|
2021-01-27 07:18:20
|
<elukey>
|
good morning
|
2021-01-27 07:33:06
|
<wikibugs>
|
('CR) ''Thiemo Kreuz (WMDE): [C: ''+1] Collect metrics of all wikis [analytics/reportupdater-queries] - ''https://gerrit.wikimedia.org/r/655886 (https://phabricator.wikimedia.org/T271894) (owner: ''WMDE-Fisch)'
|
2021-01-27 07:33:17
|
<wikibugs>
|
'Analytics, ''SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (''elukey)'
|
2021-01-27 07:33:48
|
<wikibugs>
|
'Analytics, ''SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (''elukey)'
|
2021-01-27 07:52:40
|
<elukey>
|
I am checking some rack availability for the new hadoop workers, and I found that in some we have more than 5 workers
|
2021-01-27 07:52:46
|
<elukey>
|
no bueno
|
2021-01-27 07:57:17
|
<elukey>
|
I am trying to spread nodes evenly on rows so a rack down with say 7 nodes on top shouldn't cause a ton of issues, but it is not great either
|
2021-01-27 08:01:12
|
<elukey>
|
ah no max seems to be 6
|
2021-01-27 08:06:21
|
<elukey>
|
no sigh 7 in rack C4
|
2021-01-27 08:31:28
|
<elukey>
|
ok completed the review, overall after the recent workers addition we have
|
2021-01-27 08:31:31
|
<elukey>
|
19 A
|
2021-01-27 08:31:33
|
<elukey>
|
19 B
|
2021-01-27 08:31:33
|
<elukey>
|
that looks very good
|
2021-01-27 08:31:36
|
<elukey>
|
21 C
|
2021-01-27 08:31:38
|
<elukey>
|
19 D
|
2021-01-27 08:31:46
|
<elukey>
|
so the new 6 nodes can be spread anywhere
|
2021-01-27 08:31:54
|
<elukey>
|
will comment in the task
|
2021-01-27 08:38:26
|
<wikibugs>
|
'Analytics-Clusters, ''DC-Ops, ''SRE, ''ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (''elukey) Hi @wiki_willy thanks a lot for following up! I re-done the calculations of the workers' distribution after the last racking and this is what I g...'
|
2021-01-27 08:38:40
|
<elukey>
|
added my notes to --^
|
2021-01-27 09:17:10
|
<wikibugs>
|
'Analytics, ''Performance-Team: Coal graphs died around 2021-01-26 20:50 UTC - https://phabricator.wikimedia.org/T273033 (''Gilles)'
|
2021-01-27 09:26:14
|
<wikibugs>
|
'Analytics, ''Performance-Team, ''Patch-For-Review: Coal graphs died around 2021-01-26 20:50 UTC - https://phabricator.wikimedia.org/T273033 (''Gilles) Seems like coal simply needed to be restarted, it hadn't been since python3-snappy was installed on the host a few days ago for navtiming's sake. Won't hurt...'
|
2021-01-27 09:27:32
|
<wikibugs>
|
'Analytics: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (''elukey) I had some thoughts about bottlenecks and the only one that came to mind, not mentioned in the description of the task, is the database. The only an-airflow...'
|
2021-01-27 09:36:49
|
<wikibugs>
|
'Analytics, ''Performance-Team, ''Patch-For-Review: Coal graphs died around 2021-01-26 20:50 UTC - https://phabricator.wikimedia.org/T273033 (''Gilles) p:''Triage→''High'
|
2021-01-27 09:44:01
|
<wikibugs>
|
'Analytics, ''SRE, ''ops-eqiad: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (''elukey)'
|
2021-01-27 10:22:29
|
<wikibugs>
|
'Analytics, ''Better Use Of Data, ''Event-Platform, ''Product-Infrastructure-Data, and 3 others: EventLogging PHP EventServiceClient should use EventBus->send(). - https://phabricator.wikimedia.org/T272863 (''hashar) That is the MediaWiki installer failing: `counterexample * A dependency error was encount...'
|
2021-01-27 10:26:09
|
<wikibugs>
|
'Analytics, ''Better Use Of Data, ''Event-Platform, ''Product-Infrastructure-Data, and 5 others: EventLogging PHP EventServiceClient should use EventBus->send(). - https://phabricator.wikimedia.org/T272863 (''hashar)'
|
2021-01-27 10:37:39
|
<wikibugs>
|
'Analytics, ''Better Use Of Data, ''Event-Platform, ''Product-Infrastructure-Data, and 5 others: EventLogging PHP EventServiceClient should use EventBus->send(). - https://phabricator.wikimedia.org/T272863 (''hashar) The CI config change to add EventBus to the wmf-quibble* jobs is https://gerrit.wikimedia...'
|
2021-01-27 11:15:56
|
<elukey>
|
!log add client_port and debug fields to X-Analytics in webrequest varnishkafka streams
|
2021-01-27 11:15:58
|
<stashbot>
|
Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
|
2021-01-27 11:21:14
|
<wikibugs>
|
'Analytics, ''Analytics-Kanban, ''Patch-For-Review: Add client TCP source port to webrequest - https://phabricator.wikimedia.org/T271953 (''elukey) Both changed deployed by Valentin, I checked the client_port field in webrequest_text on Kafka and it works nicely. The debug header needs to be triggered by an...'
|
2021-01-27 12:23:08
|
<elukey>
|
lunch!
|
2021-01-27 12:30:01
|
<klausman>
|
Same.
|
2021-01-27 12:42:21
|
<wikibugs>
|
'Analytics, ''Better Use Of Data, ''Event-Platform, ''Product-Infrastructure-Data, and 5 others: EventLogging PHP EventServiceClient should use EventBus->send(). - https://phabricator.wikimedia.org/T272863 (''hashar) On the dummy change https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/6589...'
|
2021-01-27 13:02:39
|
<joal>
|
!log Copy /wmf/data/event to backup cluster (30Tb) - T272846
|
2021-01-27 13:02:41
|
<stashbot>
|
Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
|
2021-01-27 13:02:42
|
<stashbot>
|
T272846: Backup HDFS data before BigTop upgrade - https://phabricator.wikimedia.org/T272846
|
2021-01-27 13:36:14
|
<dsaez>
|
hey a-team, good morning/afternoon/evening . I'm having issues with the pageviews API, it works correctly from the browser, but using python I get this error: https://pastebin.pl/view/7fa0efeb
|
2021-01-27 13:36:32
|
<dsaez>
|
I'm wondering if there is any user agent issue
|
2021-01-27 13:37:29
|
<mforns>
|
hi dsaez :] looking into this
|
2021-01-27 13:38:03
|
<dsaez>
|
thx mforns
|
2021-01-27 13:38:32
|
<elukey>
|
dsaez: yes it is me
|
2021-01-27 13:39:23
|
<elukey>
|
or it should be me, let's try to see :)
|
2021-01-27 13:39:33
|
<elukey>
|
are you using python-requests?
|
2021-01-27 13:39:52
|
<elukey>
|
because we added a specific block in Varnish the other day after a big surge in traffic
|
2021-01-27 13:40:05
|
<elukey>
|
following https://meta.wikimedia.org/wiki/User-Agent_policy
|
2021-01-27 13:40:20
|
<elukey>
|
so the block returns a 403 in this case but it should mention the UA policy
|
2021-01-27 13:40:26
|
<elukey>
|
that I don't see in your paste
|
2021-01-27 13:40:31
|
<elukey>
|
what is the HTTP error code returneD?
|
2021-01-27 13:41:09
|
<elukey>
|
also, can you give us the link to check?
|
2021-01-27 13:41:16
|
<wikibugs>
|
'Analytics: Add user to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (''gmodena)'
|
2021-01-27 13:41:35
|
<mforns>
|
elukey: it's a 403
|
2021-01-27 13:41:43
|
<dsaez>
|
yep 403
|
2021-01-27 13:41:56
|
<dsaez>
|
I'm using requests
|
2021-01-27 13:42:13
|
<dsaez>
|
for example https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/de.wikipedia/all-access/user/Johann_Wolfgang_von_Goethe/daily/2015101300/2015102700
|
2021-01-27 13:42:19
|
<dsaez>
|
that works from the browser
|
2021-01-27 13:42:22
|
<mforns>
|
the error message mentions indeed: Scripted requests from your IP have been blocked, please see https://meta.wikimedia.org/wiki/User-Agent_policy.
|
2021-01-27 13:43:39
|
<dsaez>
|
but this: requests.get(that_url) returns the error
|
2021-01-27 13:44:26
|
<dsaez>
|
elukey, sorry, I don't get it. This is an API, so which is the expected UA ?
|
2021-01-27 13:45:31
|
<mforns>
|
dsaez: The generic format is <client name>/<version> (<contact information>) <library/framework name>/<version> [<library name>/<version> ...]. Parts that are not applicable can be omitted. See: https://meta.wikimedia.org/wiki/User-Agent_policy
|
2021-01-27 13:45:42
|
<elukey>
|
mforns: ah didn't see that yes
|
2021-01-27 13:46:16
|
<mforns>
|
dsaez: I think you can use requests to send user agent:
|
2021-01-27 13:46:20
|
<elukey>
|
dsaez: You'd need to provide a UA that can tell us how to contact you in case the volume of requests is big
|
2021-01-27 13:46:42
|
<mforns>
|
response = requests.get(url, headers = {'User-agent': 'blahblah'})
|
2021-01-27 13:47:02
|
<elukey>
|
this is far from perfect, we may lift the block very soon (it was due to emergency) but in general we should follow the UA policy for the APIs
|
2021-01-27 13:47:51
|
<dsaez>
|
got it. Sounds very strict, I've done two calls.
|
2021-01-27 13:49:02
|
<elukey>
|
yes yes I know, we also have to figure out throttling, it is a temporary measure
|
2021-01-27 13:49:15
|
<elukey>
|
but in the long term we suggest to everybody to use a proper UA
|
2021-01-27 13:49:28
|
<dsaez>
|
got it
|
2021-01-27 13:49:29
|
<dsaez>
|
in fact
|
2021-01-27 13:49:33
|
<dsaez>
|
is not blocked
|
2021-01-27 13:49:57
|
<dsaez>
|
if I add the blahblah agent, is enough
|
2021-01-27 13:50:08
|
<elukey>
|
yes please use a better UA :D
|
2021-01-27 13:50:13
|
<dsaez>
|
hahaha
|
2021-01-27 13:50:14
|
<dsaez>
|
sure
|
2021-01-27 14:02:27
|
<wikibugs>
|
'Analytics-Radar, ''Release-Engineering-Team, ''observability, ''serviceops, and 2 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (''jijiki)'
|
2021-01-27 14:02:30
|
<wikibugs>
|
'Analytics, ''Analytics-Kanban, ''serviceops, ''User-jijiki: Mechanism to flag webrequests as "debug" - https://phabricator.wikimedia.org/T263683 (''jijiki) ''Open→''Resolved @Milimetric patch is merged! We are setting debug=1 in
the X-Analytics header if "X-Wikimedia-Debug" is present. Thank you fo...'
|
2021-01-27 14:03:50
|
<wikibugs>
|
'Analytics, ''Analytics-Kanban, ''Patch-For-Review: Add client TCP source port to webrequest - https://phabricator.wikimedia.org/T271953 (''jijiki) Debug header works, we tested it with @elukey:)'
|
2021-01-27 14:04:57
|
<elukey>
|
joal: are you around?\
|
2021-01-27 14:24:29
|
<elukey>
|
joal: I killed the copy (client + map-reduce job), we were causing network alarms :(
|
2021-01-27 14:53:07
|
<wikibugs>
|
'Analytics, ''SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (''akosiaris) This is weird. I don't think we have encountered this before. ExecStop in the systemd unit file runs `ifdown ens5` but running that on the host returns ` root@kafka-test1006:...'
|
2021-01-27 14:53:39
|
<wikibugs>
|
'Analytics: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (''Ottomata) > if we want to have two/three more Airflow instances Do we want/need this? > store a little mariadb instance on every deployment of Airflow, getting re...'
|
2021-01-27 15:00:13
|
<wikibugs>
|
'Analytics, ''Better Use Of Data, ''Event-Platform, ''Product-Analytics, ''Product-Infrastructure-Data: MEP: Should stream configurations be written in YAML? - https://phabricator.wikimedia.org/T269774 (''Ottomata) > Create a new repo for stream configs and add it as a git submodule to
operations/mediaw...'
|
2021-01-27 15:14:06
|
<wikibugs>
|
'Analytics: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (''elukey) @Ottomata the main problem that I can see how is that multi-tenancy is not really something that Airflow does well (and the people from Polidea confirmed th...'
|
2021-01-27 15:17:34
|
<wikibugs>
|
'Analytics, ''SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (''elukey) @akosiaris not reliably, but today I rebooted the 4 schema VMs and one of them got back with the same issue..'
|
2021-01-27 15:24:53
|
<wikibugs>
|
'Analytics, ''SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (''MoritzMuehlenhoff) >>! In T273026#6780528, @akosiaris wrote: > This is weird. I don't think we have encountered this before. > > ExecStop in the systemd unit file runs `ifdown ens5` but...'
|
2021-01-27 15:28:31
|
<joal>
|
heya elukey
|
2021-01-27 15:28:37
|
<joal>
|
sorry I'm with kids
|
2021-01-27 15:28:41
|
<joal>
|
good that you killed it
|
2021-01-27 15:28:52
|
<joal>
|
Let's review togother when I have time
|
2021-01-27 15:30:46
|
<elukey>
|
ack! I just pinged if you were around, I used the hammer :D
|
2021-01-27 15:38:14
|
<wikibugs>
|
'Analytics, ''SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (''akosiaris) >>! In T273026#6780640, @MoritzMuehlenhoff wrote: >>>! In T273026#6780528, @akosiaris wrote: >> This is weird. I don't think we have encountered this before. >> >> ExecStop in...'
|
2021-01-27 15:39:11
|
<wikibugs>
|
'Analytics, ''SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (''elukey) I recall VMs only from my past experience, I encountered this problem a couple of times before this one.'
|
2021-01-27 15:45:08
|
<wikibugs>
|
'Analytics, ''Analytics-Kanban, ''Event-Platform, ''EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (''JMeybohm) >>! In T269160#6777382, @elukey wrote: > Waiting for @JMeybohm'...'
|
2021-01-27 15:46:24
|
<wikibugs>
|
'Analytics, ''SRE, ''ops-eqiad: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (''elukey) @Ottomata @razzi this is the first datanode disk failure after the change that I made to use facter to populate the available partitions that Yarn and HDFS can use on a given worker node. In...'
|
2021-01-27 15:47:22
|
<wikibugs>
|
'Analytics, ''Analytics-Kanban, ''Event-Platform, ''EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (''elukey) >>! In T269160#6780685, @JMeybohm wrote: >>>! In T269160#6777382,...'
|
2021-01-27 15:47:31
|
<wikibugs>
|
'Analytics, ''Event-Platform: Rematerialise all event schemas with enforceNumericBounds: true - https://phabricator.wikimedia.org/T273069 (''Ottomata)'
|
2021-01-27 15:47:37
|
<wikibugs>
|
'Analytics, ''Event-Platform: Rematerialize all event schemas with enforceNumericBounds: true - https://phabricator.wikimedia.org/T273069 (''Ottomata)'
|
2021-01-27 15:49:36
|
<wikibugs>
|
'Analytics, ''SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (''MoritzMuehlenhoff) >>! In T273026#6780670, @akosiaris wrote: > Do you by any chance remember if it was on VMs only? Or was it physical hosts too? From my memory only VMs. I've checked my...'
|
2021-01-27 15:50:54
|
<elukey>
|
ottomata: if you are ok I'd heml eventstreams-internal!
|
2021-01-27 15:51:30
|
<elukey>
|
cd /srv/deployment-charts/helmfile.d/services/eventstreams-internal; helmfile -e codfw -i apply
|
2021-01-27 15:51:33
|
<elukey>
|
and then eqiad
|
2021-01-27 15:51:39
|
<elukey>
|
does it sound ok?
|
2021-01-27 15:51:58
|
<ottomata>
|
go for it!
|
2021-01-27 15:51:59
|
<ottomata>
|
yes!
|
2021-01-27 15:52:10
|
<ottomata>
|
(no lvs yet, right?
|
2021-01-27 15:52:11
|
<ottomata>
|
)
|
2021-01-27 15:52:49
|
<ottomata>
|
not sure if i can test very easily without, would have to do some curl --resolve magic and look up lots of stuff, but if the kube logs look good we can assume i tworks
|
2021-01-27 15:52:58
|
<ottomata>
|
will look at logs after you apply
|
2021-01-27 15:53:30
|
<elukey>
|
no lvs exactly
|
2021-01-27 15:54:56
|
<elukey>
|
ok we can start with
|
2021-01-27 15:54:56
|
<elukey>
|
Error: pods is forbidden: User "eventstreams-internal" cannot list resource "pods" in API group "" in the namespace "eventstreams-internal"
|
2021-01-27 15:56:56
|
<wikibugs>
|
'Analytics, ''Analytics-Kanban, ''Event-Platform, ''EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (''elukey) ` Error: pods is forbidden: User "eventstreams-internal" cannot l...'
|
2021-01-27 16:00:39
|
<wikibugs>
|
'Analytics, ''Performance-Team: Coal graphs died around 2021-01-26 20:50 UTC - https://phabricator.wikimedia.org/T273033 (''Gilles) ''Open→''Resolved a:''Gilles Restarting coal fixed the data, as expected: {F34044291}'
|
2021-01-27 16:01:02
|
<elukey>
|
ah I may know why
|
2021-01-27 16:01:26
|
<wikibugs>
|
'Analytics, ''Analytics-Kanban, ''Event-Platform, ''EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (''JMeybohm) You probably have not yet depoyed the admin part (the new names...'
|
2021-01-27 16:04:19
|
<wikibugs>
|
'Analytics, ''Analytics-Kanban, ''Event-Platform, ''EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (''elukey) >>! In T269160#6780761, @JMeybohm wrote: > You probably have not...'
|
2021-01-27 16:06:48
|
<wikibugs>
|
'Analytics, ''Analytics-Kanban, ''Event-Platform, ''EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (''JMeybohm) Apart from you testing my attention again (kube_env admin [codf...'
|
2021-01-27 16:13:09
|
<wikibugs>
|
('PS1) ''Mforns: Make HiveToDruid return exit code when deployMode=client [analytics/refinery/source] - ''https://gerrit.wikimedia.org/r/659017 (https://phabricator.wikimedia.org/T271568)'
|
2021-01-27 16:14:23
|
<ottomata>
|
elukey: should we start refering to presto as trino?
|
2021-01-27 16:14:34
|
<ottomata>
|
was thikning about adding presto support to wmfdata python
|
2021-01-27 16:14:40
|
<ottomata>
|
looked fro client
|
2021-01-27 16:14:41
|
<ottomata>
|
https://github.com/trinodb/trino-python-client
|
2021-01-27 16:14:44
|
<ottomata>
|
looks like the one maybe
|
2021-01-27 16:15:19
|
<elukey>
|
ottomata: to avoid too much work, I'd just upgrade to the latest presto (fb presto) and then think about migrating to trino later
|
2021-01-27 16:15:31
|
<elukey>
|
I thought we agreed on this during a standup :D
|
2021-01-27 16:19:06
|
<wikibugs>
|
'Analytics, ''Patch-For-Review: Follow up on Druid alarms not firing when Druid indexations were failing due to permission issues - https://phabricator.wikimedia.org/T271568 (''mforns) After some tests, I think the problem lies in the code: ` if (spark.conf.get("spark.master") != "yarn") { sys.exit(if (su...'
|
2021-01-27 16:19:45
|
<wikibugs>
|
'Analytics, ''Analytics-EventLogging, ''Analytics-Kanban, ''Event-Platform, and 2 others: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (''Ottomata)'
|
2021-01-27 16:19:50
|
<ottomata>
|
elukey: my memory is poor
|
2021-01-27 16:20:09
|
<ottomata>
|
if did presto in wmfdata then, should I use https://github.com/prestodb/presto-python-client instead?
|
2021-01-27 16:20:30
|
<ottomata>
|
trino one has more recent commits
|
2021-01-27 16:20:52
|
<elukey>
|
yes that client should be ok in my opinion
|
2021-01-27 16:21:19
|
<elukey>
|
to clarify - if we want to move to trino I am 100% onboard, it seemed only too much for us
|
2021-01-27 16:21:38
|
<elukey>
|
but if you want to move to trino +1
|
2021-01-27 16:23:04
|
<ottomata>
|
elukey: naw i'm not trying to expedite move to it
|
2021-01-27 16:23:12
|
<ottomata>
|
just wondering what our language should be, but
|
2021-01-27 16:23:21
|
<ottomata>
|
it sounds like for my q: we should keep saying 'presto'
|
2021-01-27 16:23:27
|
<ottomata>
|
i can use a trino client now
|
2021-01-27 16:23:38
|
<ottomata>
|
and later when we change rename to 'trino' in wmfdata
|
2021-01-27 16:23:40
|
<ottomata>
|
e.g. ^
|
2021-01-27 16:32:48
|
<joal>
|
Here I am
|
2021-01-27 16:33:18
|
<joal>
|
elukey: Hi :)
|
2021-01-27 16:33:29
|
<joal>
|
elukey: I'm sorry again about the network mess :(
|
2021-01-27 16:36:00
|
<joal>
|
razzi: Hello :) would you have a minute for me?
|
2021-01-27 16:47:52
|
<elukey>
|
joal: not your fault :)
|
2021-01-27 16:48:00
|
<joal>
|
I wondered :S
|
2021-01-27 16:48:06
|
<joal>
|
:)
|
2021-01-27 16:48:10
|
<elukey>
|
no I mean it was the data copy
|
2021-01-27 16:48:14
|
<joal>
|
I probably shouldn't be back :)
|
2021-01-27 16:48:19
|
<elukey>
|
but you didn't really do it on purpose
|
2021-01-27 16:48:20
|
<joal>
|
It was elukey
|
2021-01-27 16:48:23
|
<elukey>
|
so not your fault :)
|
2021-01-27 16:48:25
|
<joal>
|
well, I did!
|
2021-01-27 16:48:35
|
<elukey>
|
uffff
|
2021-01-27 16:48:47
|
<elukey>
|
I strongly disagree :D
|
2021-01-27 16:49:00
|
<joal>
|
We knew it would put load on the network - We just didn't know how much and how much was too much :)
|
2021-01-27 16:49:01
|
<elukey>
|
but I cannot really convince you otherwise :D
|
2021-01-27 16:49:04
|
<joal>
|
hehehe :)
|
2021-01-27 16:49:27
|
<joal>
|
anyway - Shall I try with half the number of mappers?
|
2021-01-27 16:50:47
|
<joal>
|
elukey: --^
|
2021-01-27 16:51:38
|
<wikibugs>
|
('CR) ''Elukey: [C: ''+1] "Completely ignorant about this but the option looks present for 2.4 and it makes sense to me, thanks Marcel!" [analytics/refinery/source] - ''https://gerrit.wikimedia.org/r/659017 (https://phabricator.wikimedia.org/T271568) (owner: ''Mforns)'
|
2021-01-27 16:52:14
|
<mforns>
|
thanks for the CR elukey :]
|
2021-01-27 16:53:38
|
<elukey>
|
joal: yes let's try!
|
2021-01-27 16:53:40
|
<wikibugs>
|
('CR) ''Joal: [C: ''+1] "LGTM!Thanks @mforns" [analytics/refinery/source] - ''https://gerrit.wikimedia.org/r/659017 (https://phabricator.wikimedia.org/T271568) (owner: ''Mforns)'
|
2021-01-27 16:53:46
|
<joal>
|
ack elukey - launching the thing
|
2021-01-27 16:53:51
|
<elukey>
|
joal: is there a way to throttle it a bit too?
|
2021-01-27 16:54:17
|
<mforns>
|
thx for CR joal, do you know why we are not returning exit code inside YARN?
|
2021-01-27 16:55:28
|
<joal>
|
mforns: I imagine we could, but there would be no way to actually take advantage of it I think
|
2021-01-27 16:56:26
|
<wikibugs>
|
'Analytics, ''Analytics-Kanban, ''Event-Platform, ''EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (''elukey) es-internal deployed in both eqiad and codfw, next steps are: -...'
|
2021-01-27 16:56:37
|
<mforns>
|
joal: aha
|
2021-01-27 16:58:30
|
<wikibugs>
|
'Analytics, ''Analytics-EventLogging, ''Analytics-Kanban, ''Event-Platform, and 2 others: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (''mforns)'
|
2021-01-27 16:58:40
|
<wikibugs>
|
'Analytics: Filter out webrequest where debug=1 from pageview - https://phabricator.wikimedia.org/T273083 (''JAllemandou)'
|
2021-01-27 17:00:11
|
<wikibugs>
|
'Analytics, ''Analytics-Kanban, ''Event-Platform, ''EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (''Ottomata) @elukey [[ https://logstash.wikimedia.org/goto/b408da9f4b39f66a...'
|
2021-01-27 17:01:58
|
<ottomata>
|
fdans: milimetric joal yoohoo!
|
2021-01-27 17:02:04
|
<joal>
|
elukey: file-listing done, actual copy starting
|
2021-01-27 17:02:12
|
<joal>
|
ottomata: tuning-session!
|
2021-01-27 17:02:23
|
<ottomata>
|
oh ho ok
|
2021-01-27 17:02:53
|
<joal>
|
elukey: 8.8M files to be copied
|
2021-01-27 17:03:33
|
<wikibugs>
|
'Analytics, ''SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (''Legoktm) p:''Triage→''Low'
|
2021-01-27 17:04:31
|
<joal>
|
elukey: I also have a question when ou have a minute
|
2021-01-27 17:05:43
|
<mforns>
|
joal: ping standup?
|
2021-01-27 17:05:58
|
<joal>
|
mforns: tuning session? shall I maybe not be there?
|
2021-01-27 17:06:06
|
<joal>
|
fdans: --^ ?
|
2021-01-27 17:06:12
|
<mforns>
|
oh!
|
2021-01-27 17:10:29
|
<wikibugs>
|
'Analytics, ''SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (''akosiaris) I 'll take your word for it. +1 on the cleanup thing.'
|
2021-01-27 17:13:40
|
<elukey>
|
Amir1: the client_port flag is now in new webrequest data, so if you need to check/use it you can :)
|
2021-01-27 17:13:50
|
<elukey>
|
what is the ideal use case? Query via Superset?
|
2021-01-27 17:13:55
|
<elukey>
|
or do you use hive via cli?
|
2021-01-27 17:13:59
|
<elukey>
|
(or even presto)
|
2021-01-27 17:14:00
|
<Amir1>
|
Awesome
|
2021-01-27 17:14:05
|
<Amir1>
|
I do hive
|
2021-01-27 17:14:07
|
<Amir1>
|
beeline
|
2021-01-27 17:14:19
|
<elukey>
|
perfect
|
2021-01-27 17:14:23
|
<joal>
|
Amir1: I suggest you try spark ;)
|
2021-01-27 17:14:29
|
<Amir1>
|
I need to ask the cu in ukwiki
|
2021-01-27 17:14:46
|
<Amir1>
|
usually yes but this one is a specific problem :D
|
2021-01-27 17:15:03
|
<Amir1>
|
Thank you!
|
2021-01-27 17:15:28
|
<wikibugs>
|
'Analytics, ''Analytics-EventLogging, ''Analytics-Kanban, ''Event-Platform, and 2 others: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (''Ottomata) a:''Gilles→''Ottomata'
|
2021-01-27 17:18:15
|
<joal>
|
elukey: I just restarted the copy job - I realized I messed up and had not changed the number of mappers :(
|
2021-01-27 17:19:12
|
<Amir1>
|
https://phabricator.wikimedia.org/T265692#6781099 let the CU know
|
2021-01-27 17:27:08
|
<wikibugs>
|
'Analytics, ''SRE: archiva artifact links point to 127.0.0.1 - https://phabricator.wikimedia.org/T164993 (''elukey)'
|
2021-01-27 17:34:50
|
<joal>
|
razzi: not sure if you got my previous ping with the irc issues - trying again
|
2021-01-27 17:35:37
|
<razzi>
|
joal: didn't see the ping, please go again :)
|
2021-01-27 17:35:43
|
<joal>
|
Hi razzi :)
|
2021-01-27 17:35:49
|
<joal>
|
I have a questio
|
2021-01-27 17:35:53
|
<joal>
|
if you have a minute
|
2021-01-27 17:36:07
|
<razzi>
|
indeed I do
|
2021-01-27 17:36:28
|
<joal>
|
razzi: Can you confirm that user eyener is in analytics-privatedata-users group?
|
2021-01-27 17:36:50
|
<joal>
|
I think elukey told me 10 times ho to do it, and I still can't recall :(
|
2021-01-27 17:37:59
|
<razzi>
|
joal: I can confirm that user is in analytics-privatedata-users by running `groups eyener`
|
2021-01-27 17:38:54
|
<joal>
|
ack razzi - I wouldn't have expected I can run groups command as not-root - Thanks a lot!!
|
2021-01-27 17:39:19
|
<razzi>
|
you're welcome :)
|
2021-01-27 17:56:23
|
<joal>
|
eyener: Hi! I'm reading your comment on the presto error ticket
|
2021-01-27 18:03:03
|
<elukey>
|
razzi: do you want to reboot an-launcher1002?
|
2021-01-27 18:03:36
|
<razzi>
|
elukey: yeah, bc?
|
2021-01-27 18:03:58
|
<razzi>
|
elukey: or maybe it's not that involved and we can do so async
|
2021-01-27 18:04:09
|
<elukey>
|
razzi: I think that we can do it in here if you are ok
|
2021-01-27 18:05:57
|
<elukey>
|
razzi: to recap - first thing is to check what's running with 'systemctl list-timers'
|
2021-01-27 18:06:33
|
<elukey>
|
we have to identify the prefixes to stop
|
2021-01-27 18:06:44
|
<elukey>
|
ah also, let's disable puppet
|
2021-01-27 18:06:58
|
<elukey>
|
with something like "Razzi - prepping for reboot"
|
2021-01-27 18:07:06
|
<razzi>
|
elukey: sounds good
|
2021-01-27 18:07:13
|
<elukey>
|
one first example could be
|
2021-01-27 18:07:27
|
<elukey>
|
sudo systemctl stop 'reportupdater-*.timer'
|
2021-01-27 18:07:41
|
<elukey>
|
the important bit here is the .timer at the end
|
2021-01-27 18:07:59
|
<elukey>
|
since if you do stop reportupdater-* you'll target the service, that might be running
|
2021-01-27 18:08:06
|
<elukey>
|
we want to stop scheduled executions
|
2021-01-27 18:08:16
|
<elukey>
|
(and basically gently draining)
|
2021-01-27 18:08:41
|
<elukey>
|
eventually you'll end up with systemctl list-timers showing only system level timers
|
2021-01-27 18:08:44
|
<elukey>
|
like logrotate etc..
|
2021-01-27 18:08:47
|
<elukey>
|
that are fine to run
|
2021-01-27 18:09:04
|
<elukey>
|
once done, we'll need to check if any java/python processes are running
|
2021-01-27 18:09:20
|
<elukey>
|
if yes, let's wait until the finish, otherwise green light to reboot
|
2021-01-27 18:09:33
|
<elukey>
|
then puppet enable + run and the maintenance is done :)
|
2021-01-27 18:09:35
|
<razzi>
|
I don't see reportupdater- timers in systemctl list-timers
|
2021-01-27 18:09:49
|
<elukey>
|
Wed 2021-01-27 19:00:00 UTC 53min left Wed 2021-01-27 18:00:00 UTC 6min ago reportupdater-browser.timer
|
2021-01-27 18:10:03
|
<elukey>
|
on what host are you?
|
2021-01-27 18:10:09
|
<razzi>
|
:) an-master oops
|
2021-01-27 18:10:18
|
<elukey>
|
ah yes it makes sense then :D
|
2021-01-27 18:10:35
|
<razzi>
|
Reenabled puppet, now on to an-launcher1002
|
2021-01-27 18:10:51
|
<wikibugs>
|
'Analytics-Radar: Presto error in Superest - only when grouping - https://phabricator.wikimedia.org/T270503 (''JAllemandou) Hi @EYener > presto error: Failed to list directory: hdfs://analytics-hadoop/wmf/data/event_sanitized/CentralNoticeBannerHistory/year=2021/month=1/day=9/hour=21 I have not experienced th...'
|
2021-01-27 18:12:42
|
<joal>
|
elukey: are we still ok in term of network?
|
2021-01-27 18:14:36
|
<elukey>
|
joal: it seems so yes, no complains about links saturation
|
2021-01-27 18:14:46
|
<joal>
|
ack elukey - thanks for checking
|
2021-01-27 18:15:02
|
<joal>
|
elukey: something else if you may?
|
2021-01-27 18:15:58
|
<elukey>
|
joal: sure what's up
|
2021-01-27 18:16:44
|
<joal>
|
elukey: we're gonna need all users setup on the backup cluster :(
|
2021-01-27 18:17:16
|
<joal>
|
elukey: the /user folder is looking wrong despite me having resync
|
2021-01-27 18:17:24
|
<joal>
|
in terms of ownership
|
2021-01-27 18:17:36
|
<elukey>
|
joal: you wiped and re-copied right?
|
2021-01-27 18:17:52
|
<joal>
|
elukey: I distcp -update - which should do the same
|
2021-01-27 18:18:53
|
<eyener>
|
joal Awesome! You fixed it! :) I am not sure what the issue was but every chart in that dash was failing to load yesterday
|
2021-01-27 18:19:20
|
<joal>
|
eyener: eh :) Fixing without touching is m prefered way - ususaly doesn't work though :)
|
2021-01-27 18:20:02
|
<joal>
|
thanks for letting me know eyener - sorry for no good answer on updating charts (yet)
|
2021-01-27 18:21:20
|
<eyener>
|
Ha no worries joal - appreciate you checking it out. I've asked in the Superset slack workspace as well and haven't received a reply but I'll let you know if I ever figure it out
|
2021-01-27 18:21:29
|
<eyener>
|
maybe some jinja templating or something...?
|
2021-01-27 18:22:11
|
<joal>
|
very possible eyener - /me is no superset ninja for sure
|
2021-01-27 18:22:13
|
<elukey>
|
joal: not sure, have you tried to explicitly wipe and copy a single user dir? Just to see if perms are weird
|
2021-01-27 18:22:28
|
<elukey>
|
in theory users are already deployed on the cluster, on all nodes
|
2021-01-27 18:22:31
|
<elukey>
|
masters + workers
|
2021-01-27 18:22:35
|
<joal>
|
MQH
|
2021-01-27 18:22:37
|
<joal>
|
MEH
|
2021-01-27 18:22:50
|
<wikibugs>
|
'Analytics, ''Product-Infrastructure-Data, ''Wikimedia-Logstash, ''observability, ''Patch-For-Review: Create a separate logstash ElasticSearch index for schemaed events - https://phabricator.wikimedia.org/T265938 (''Ottomata) In a meeting with devs doing client error logging today, we
realized that conf...'
|
2021-01-27 18:23:06
|
<joal>
|
elukey: I'll try wipe-out for real and see if it changes anything
|
2021-01-27 18:23:29
|
<joal>
|
elukey: and I'll use 64 mappers as my basis
|
2021-01-27 18:24:01
|
<elukey>
|
perfect thanks
|
2021-01-27 18:24:11
|
<elukey>
|
if it doesn't work we can check again but it is weird
|
2021-01-27 18:24:16
|
<joal>
|
sure elukey
|
2021-01-27 18:24:33
|
<joal>
|
thanks for confirming that the hardware should be ready
|
2021-01-27 18:25:15
|
<mforns>
|
joal, I believe the changes you did to hdfs cleaner need to be deployed?
|
2021-01-27 18:25:40
|
<joal>
|
mforns: I think elukey did?
|
2021-01-27 18:25:47
|
<joal>
|
maybe not?
|
2021-01-27 18:26:01
|
<mforns>
|
joal: isn't the hdfs cleaner in refinery repo?
|
2021-01-27 18:26:20
|
<elukey>
|
mforns: yep the three timers have been deployed
|
2021-01-27 18:26:34
|
<mforns>
|
ok elukey thanks
|
2021-01-27 18:26:38
|
<joal>
|
mforns: I have not changed the code - only added puppet stuff :)
|
2021-01-27 18:26:46
|
<wikibugs>
|
'Analytics-Clusters, ''DC-Ops, ''SRE, ''ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (''wiki_willy) Hi @elukey - thanks for the mapping. What makes it tough is that the remaining 6x hosts need to be on 10g switches, which really limits our op...'
|
2021-01-27 18:26:46
|
<mforns>
|
ok ok :]
|
2021-01-27 18:26:54
|
<joal>
|
thanks for checking mforns
|
2021-01-27 18:31:52
|
<elukey>
|
mforns: in theory I should be on-call now right? Anything to handover?
|
2021-01-27 18:31:59
|
<elukey>
|
forgot to ask during standup
|
2021-01-27 18:32:06
|
<elukey>
|
razzi: how are things going?
|
2021-01-27 18:32:15
|
<mforns>
|
elukey: no no, it's tomorrow
|
2021-01-27 18:32:33
|
<razzi>
|
elukey: good, have stopped some more timers, still going through the list
|
2021-01-27 18:32:45
|
<elukey>
|
okok
|
2021-01-27 18:36:47
|
<razzi>
|
I believe the following services should be kept, am I missing any?
|
2021-01-27 18:36:47
|
<razzi>
|
export_smart_data_dump.service
|
2021-01-27 18:36:47
|
<razzi>
|
logrotate.service
|
2021-01-27 18:36:47
|
<razzi>
|
man-db.service
|
2021-01-27 18:36:47
|
<razzi>
|
systemd-tmpfiles-clean.service
|
2021-01-27 18:38:19
|
<razzi>
|
oh and apt-daily.service and apt-daily-upgrade.service
|
2021-01-27 18:38:30
|
<elukey>
|
yes yes
|
2021-01-27 18:38:45
|
<elukey>
|
the only one that you missed is the hdfs-cleaner-*
|
2021-01-27 18:38:59
|
<elukey>
|
those are the periodical jobs that clean up some dirs in hdfs
|
2021-01-27 18:40:29
|
<elukey>
|
razzi: --^
|
2021-01-27 18:41:36
|
<razzi>
|
cool
|
2021-01-27 18:45:34
|
<elukey>
|
razzi: can you stop them?
|
2021-01-27 18:45:45
|
<elukey>
|
so we can proceed with the next steps :)
|
2021-01-27 18:46:08
|
<razzi>
|
yes yes, got distracted
|
2021-01-27 18:47:51
|
<elukey>
|
razzi: also there are mediawiki-* and hdfs-balancer
|
2021-01-27 18:48:30
|
<elukey>
|
we should really think about changing the names, adding something line analytics- in front
|
2021-01-27 18:48:34
|
<razzi>
|
How about prometheus-nic-firmware-textfile / prometheus_intel_microcode?
|
2021-01-27 18:49:17
|
<elukey>
|
those are fine, the prometheus exporters can be left aside
|
2021-01-27 18:49:23
|
<elukey>
|
they just expose metrics
|
2021-01-27 18:49:30
|
<razzi>
|
ok cool
|
2021-01-27 18:50:00
|
<elukey>
|
then we need to make sure that no java/python processes are running, and if so we'd need to wait
|
2021-01-27 18:50:33
|
<razzi>
|
so wait should hdfs-balancer and mediawiki* be stopped?
|
2021-01-27 18:54:05
|
<elukey>
|
yep yep
|
2021-01-27 18:54:16
|
<elukey>
|
those don't need to run while we reboot
|
2021-01-27 18:55:50
|
<razzi>
|
ok should be all set to reboot
|
2021-01-27 18:57:10
|
<elukey>
|
razzi: what about java/python processes running?
|
2021-01-27 18:57:16
|
<razzi>
|
oh right
|
2021-01-27 18:57:38
|
<elukey>
|
also you didn't stop the hdfs-cleaner timers
|
2021-01-27 18:58:21
|
<mforns>
|
ottomata: not sure if I need a +1 for these, but just in case, can you look? :]
|
2021-01-27 18:58:23
|
<mforns>
|
https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/659022
|
2021-01-27 18:58:28
|
<mforns>
|
https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/658426
|
2021-01-27 19:00:15
|
<razzi>
|
elukey: alright, stopped
|
2021-01-27 19:01:59
|
<fkaelin>
|
We would like to run a spark job that downloads all commons images from swift and stores the base64 image bytes in a column on hdfs, there will be roughly 7TB of data. Is there a recommended folder to store such a dataset, ie so that the size will not cause problems and it is available for others on the team?
|
2021-01-27 19:04:28
|
<elukey>
|
fkaelin: I suspect that you are working with Miriam :D
|
2021-01-27 19:05:34
|
<elukey>
|
fkaelin: so there are a couple of things to check - how many files are we talking about? (the hdfs namenode suffers a bit when we add million files more etc..)
|
2021-01-27 19:05:45
|
<elukey>
|
I am more concerned about that than those 7TB of space
|
2021-01-27 19:06:04
|
<elukey>
|
razzi: so next steps? :)
|
2021-01-27 19:06:31
|
<razzi>
|
I see a couple of python processes: eventlogging_to_druid_navigationtiming_hourly and eventlogging_to_druid_navigationtiming_daily
|
2021-01-27 19:06:31
|
<razzi>
|
and a couple java ones org.wikimedia.analytics.refinery.job.HiveToDruid
|
2021-01-27 19:06:31
|
<razzi>
|
No idea how long they'll take to finish
|
2021-01-27 19:06:47
|
<elukey>
|
razzi: perfect
|
2021-01-27 19:06:54
|
<elukey>
|
one thing to check is when they started
|
2021-01-27 19:07:02
|
<elukey>
|
one on Jan25
|
2021-01-27 19:07:13
|
<elukey>
|
the other on Jan21
|
2021-01-27 19:07:30
|
<elukey>
|
or not sorry lemme check better
|
2021-01-27 19:07:34
|
<elukey>
|
I might say something silly
|
2021-01-27 19:08:20
|
<elukey>
|
mmm yes weird they have been running for a while
|
2021-01-27 19:08:50
|
<elukey>
|
mforns: holaaaaa
|
2021-01-27 19:08:55
|
<elukey>
|
do you have a min?
|
2021-01-27 19:09:36
|
<elukey>
|
the navtiming hourly + daily hive2druid indexations seem taking a lot of times, they started hours and hours ago
|
2021-01-27 19:09:45
|
<elukey>
|
has it ever happened that they got stuck?
|
2021-01-27 19:10:17
|
<fkaelin>
|
elukey yes, that is work with miriam. the image bytes will be stored as byte64 encoded strings in a schema, so the number of files depends on the whatever blocksize hadoop/spark chooses
|
2021-01-27 19:11:39
|
<elukey>
|
fkaelin: okok so 7TB is a bit but we have a lot of space, and it is a one off, the only thing that we should check is how many files will be generated.. if it is say 10 millions it might be a problem, if we are taking about few thousand I think it is fine
|
2021-01-27 19:12:20
|
<elukey>
|
fkaelin: can we run a test on a subset of data to see how many files are generated?
|
2021-01-27 19:13:26
|
<elukey>
|
our blocksize for hadoop is 256M IIRC
|
2021-01-27 19:14:06
|
<elukey>
|
razzi: since we cannot leave things stopped for so much, let's reboot an-launcher1002
|
2021-01-27 19:14:15
|
<elukey>
|
those two jobs seem stuck
|
2021-01-27 19:14:25
|
<elukey>
|
(we need to downtime first)
|
2021-01-27 19:19:50
|
<elukey>
|
razzi: I am rebooting the node myself, we should not wait this long
|
2021-01-27 19:20:13
|
<elukey>
|
we stopped camus for a long time and when it restart it lags for a while
|
2021-01-27 19:20:26
|
<elukey>
|
so when doing maintenance let's focus on the task please :)
|
2021-01-27 19:21:37
|
<razzi>
|
elukey: alright yeah
|
2021-01-27 19:24:49
|
<mforns>
|
elukey: in meeting! it finishes in 25mins
|
2021-01-27 19:25:03
|
<elukey>
|
mforns: all good! We can follow up tomorrow
|
2021-01-27 19:25:27
|
<mforns>
|
elukey: but yes, it happened start of the year!
|
2021-01-27 19:25:42
|
<elukey>
|
sigh :(
|
2021-01-27 19:25:55
|
<elukey>
|
razzi: ok host is up, can you re-enable and run puppet?
|
2021-01-27 19:27:16
|
<razzi>
|
elukey: re-enable timers via systemctl start?
|
2021-01-27 19:27:35
|
<elukey>
|
razzi: a puppet run is sufficient to restore all puppet-defined timers
|
2021-01-27 19:27:50
|
<razzi>
|
gotcha, that makes sense
|
2021-01-27 19:27:56
|
<ottomata>
|
sorry mforns looks like you got em +ed :)
|
2021-01-27 19:28:12
|
<mforns>
|
ottomata: yes, no problemo, they deployed :]
|
2021-01-27 19:29:04
|
<ottomata>
|
razzi: hm i did migrate a bunch of navigationtiming data to event platform today!
|
2021-01-27 19:29:11
|
<ottomata>
|
i wouldn't expect it tocause issue
|
2021-01-27 19:29:13
|
<ottomata>
|
but..would it?
|
2021-01-27 19:29:32
|
<ottomata>
|
mforns: can you think of anything i hive to druid that woudl need to be cahnged to deal with events with migrated schema?
|
2021-01-27 19:29:37
|
<ottomata>
|
the hive table was migrated yesterday
|
2021-01-27 19:29:39
|
<elukey>
|
one job was stuck since the 21st :(
|
2021-01-27 19:29:46
|
<elukey>
|
the other from the 25th
|
2021-01-27 19:29:55
|
<mforns>
|
ottomata: on meeting, but will repond in a bit!
|
2021-01-27 19:30:13
|
<ottomata>
|
hm yeah i didn't touch navigation timing until yesterday
|
2021-01-27 19:30:15
|
<ottomata>
|
also meting! :)
|
2021-01-27 19:30:24
|
<elukey>
|
all right I am going to dinner, ttl!
|
2021-01-27 19:30:29
|
<razzi>
|
cya elukey
|
2021-01-27 19:31:07
|
<ottomata>
|
l8rs
|
2021-01-27 19:31:23
|
<wikibugs>
|
'Analytics-Kanban, ''Better Use Of Data, ''Product-Analytics, ''Product-Infrastructure-Data: Roll-up raw sessionTick data into distribution - https://phabricator.wikimedia.org/T271455 (''sdkim) a:''mforns→''Mayakp.wiki'
|
2021-01-27 19:37:22
|
<razzi>
|
afk for lunch
|
2021-01-27 20:02:37
|
<joal>
|
gone for tonight team - see you tomorrow
|
2021-01-27 20:09:03
|
<fkaelin>
|
elukey for the tests I used the default blocksize which seems to be 64mb. So for 7TB of data we are looking at ~100k files, or ~25k if we set the blocksize to 256MB.
|
2021-01-27 20:12:26
|
<fkaelin>
|
elukey the job will run over a couple days on a small number of workers (aiming for ~100qps to swift), so the hdfs files will be created at a slow pace.
|
2021-01-27 20:27:16
|
<wikibugs>
|
'Analytics: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (''EBernhardson) Concur with regard to multi-tenancy, I tried to setup our airflow initially in a way that used the builtin multi-tenancy but as soon as I started inte...'
|
2021-01-27 20:50:34
|
<mforns>
|
ottomata: I don't see any thing that would need to be changed for HiveToDruid re. migrated schemas...
|
2021-01-27 20:50:58
|
<mforns>
|
ottomata: maybe the only thing would be the meaning of dt field
|
2021-01-27 20:51:22
|
<mforns>
|
but IIUC the meaning of dt does not change rright?
|
2021-01-27 20:52:28
|
<mforns>
|
and all other fields are available with the same name in a backwards compatible way... so, I'd say no changes needed
|
2021-01-27 20:54:26
|
<eyener>
|
joal if you're around, I'm getting another iteration of the `presto error: Failed to list directory: hdfs://analytics-hadoop/wmf/data/event_sanitized/CentralNoticeBannerHistory/year=2021/month=1/day=9/hour=1` error when I try to edit the Banner History dash
|
2021-01-27 20:57:45
|
<wikibugs>
|
('PS1) ''Mforns: Add en.wikidata to pageview whitelist [analytics/refinery] - ''https://gerrit.wikimedia.org/r/659081'
|
2021-01-27 20:59:58
|
<ottomata>
|
mforns: no its the same with legacy data
|
2021-01-27 21:00:07
|
<ottomata>
|
dt only means event time for new schemas
|
2021-01-27 21:02:49
|
<mforns>
|
I see ottomata, HiveToDruid will work for new schemas the same, the only difference (if we want to use a time field other than dt for a given dataset) would be we have to explicitly specify it from druid_load.pp (which is already supported)
|
2021-01-27 21:03:03
|
<ottomata>
|
cool!
|
2021-01-27 21:08:05
|
<wikibugs>
|
'Analytics, ''Analytics-EventLogging, ''Analytics-Kanban, ''Event-Platform, and 2 others: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (''mforns)'
|
2021-01-27 21:08:49
|
<wikibugs>
|
'Analytics, ''Analytics-Kanban, ''Event-Platform: MobileWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T267347 (''mforns)'
|
2021-01-27 21:09:00
|
<wikibugs>
|
'Analytics, ''Analytics-Kanban, ''Event-Platform, ''Patch-For-Review: DesktopWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T271164 (''mforns)'
|
2021-01-27 21:15:58
|
<wikibugs>
|
'Analytics, ''Better Use Of Data: Create Oozie job for session length - https://phabricator.wikimedia.org/T273116 (''mforns)'
|
2021-01-27 21:33:53
|
<ottomata>
|
fkaelin: default block size shoudl be 256MB
|
2021-01-27 21:33:55
|
<ottomata>
|
https://yarn.wikimedia.org/conf
|
2021-01-27 21:33:59
|
<ottomata>
|
dfs.blocksize
|
2021-01-27 21:38:43
|
<ottomata>
|
mforns: tomorrow my morningg i'm going to migrate my nav timing schemas to all wikis, if you are around we can do yours at the same time (without a deployment window)
|
2021-01-29 23:22:06
|
<razzi>
|
an-test-presto1001 is out of disk space and is causing alarms, but since it's a test node I'm not going to bother with it for now
|
2021-01-29 23:25:04
|
<wikibugs>
|
'Analytics: Presto should warn or prevent users from querying without Hive partition predicates - https://phabricator.wikimedia.org/T273004 (''razzi) One way to go about this may be to use `hive.max-partitions-per-scan`. From the docs: | hive.max-partitions-per-scan | Maximum number of partitions for a single...'
|
2021-01-29 23:27:02
|
<razzi>
|
There is also a problem on kafka-test1009: after rebooting, I see
|
2021-01-29 23:27:02
|
<razzi>
|
```
|
2021-01-29 23:27:02
|
<razzi>
|
razzi@kafka-test1009:~$ sudo systemctl list-units --failed
|
2021-01-29 23:27:02
|
<razzi>
|
UNIT LOAD ACTIVE SUB DESCRIPTION
|
2021-01-29 23:27:02
|
<razzi>
|
● ifup@ens5.service loaded failed failed ifup for ens5
|
2021-01-29 23:27:03
|
<razzi>
|
```
|
2021-01-29 23:27:03
|
<razzi>
|
Again, since it's a test node, I'm going to leave it alone
|
2021-01-29 23:41:56
|
<wikibugs>
|
'Analytics, ''Product-Data-Infrastructure, ''Wikimedia-Logstash, ''observability, ''Patch-For-Review: Create a separate logstash ElasticSearch index for schemaed events - https://phabricator.wikimedia.org/T265938 (''colewhite) >>! In T265938#6781389, @Ottomata wrote: > In a meeting
with devs doing clien...'
|