[02:11:55] 10Analytics-Radar: Presto error in Superest - only when grouping - https://phabricator.wikimedia.org/T270503 (10EYener) Hi @JAllemandou thanks for the reply! I am pulling this task back up and opened the dashboard to implement these suggestions. However, I encountered a new error on all charts: presto error: Fa... [07:18:20] good morning [07:33:06] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Collect metrics of all wikis [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/655886 (https://phabricator.wikimedia.org/T271894) (owner: 10WMDE-Fisch) [07:33:17] 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey) [07:33:48] 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey) [07:52:40] I am checking some rack availability for the new hadoop workers, and I found that in some we have more than 5 workers [07:52:46] no bueno [07:57:17] I am trying to spread nodes evenly on rows so a rack down with say 7 nodes on top shouldn't cause a ton of issues, but it is not great either [08:01:12] ah no max seems to be 6 [08:06:21] no sigh 7 in rack C4 [08:31:28] ok completed the review, overall after the recent workers addition we have [08:31:31] 19 A [08:31:33] 19 B [08:31:33] that looks very good [08:31:36] 21 C [08:31:38] 19 D [08:31:46] so the new 6 nodes can be spread anywhere [08:31:54] will comment in the task [08:38:26] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) Hi @wiki_willy thanks a lot for following up! I re-done the calculations of the workers' distribution after the last racking and this is what I g... [08:38:40] added my notes to --^ [09:17:10] 10Analytics, 10Performance-Team: Coal graphs died around 2021-01-26 20:50 UTC - https://phabricator.wikimedia.org/T273033 (10Gilles) [09:26:14] 10Analytics, 10Performance-Team, 10Patch-For-Review: Coal graphs died around 2021-01-26 20:50 UTC - https://phabricator.wikimedia.org/T273033 (10Gilles) Seems like coal simply needed to be restarted, it hadn't been since python3-snappy was installed on the host a few days ago for navtiming's sake. Won't hurt... [09:27:32] 10Analytics: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10elukey) I had some thoughts about bottlenecks and the only one that came to mind, not mentioned in the description of the task, is the database. The only an-airflow... [09:36:49] 10Analytics, 10Performance-Team, 10Patch-For-Review: Coal graphs died around 2021-01-26 20:50 UTC - https://phabricator.wikimedia.org/T273033 (10Gilles) p:05Triage→03High [09:44:01] 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10elukey) [10:22:29] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Data, and 3 others: EventLogging PHP EventServiceClient should use EventBus->send(). - https://phabricator.wikimedia.org/T272863 (10hashar) That is the MediaWiki installer failing: `counterexample * A dependency error was encount... [10:26:09] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Data, and 5 others: EventLogging PHP EventServiceClient should use EventBus->send(). - https://phabricator.wikimedia.org/T272863 (10hashar) [10:37:39] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Data, and 5 others: EventLogging PHP EventServiceClient should use EventBus->send(). - https://phabricator.wikimedia.org/T272863 (10hashar) The CI config change to add EventBus to the wmf-quibble* jobs is https://gerrit.wikimedia... [11:15:56] !log add client_port and debug fields to X-Analytics in webrequest varnishkafka streams [11:15:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:21:14] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add client TCP source port to webrequest - https://phabricator.wikimedia.org/T271953 (10elukey) Both changed deployed by Valentin, I checked the client_port field in webrequest_text on Kafka and it works nicely. The debug header needs to be triggered by an... [12:23:08] * elukey lunch! [12:30:01] Same. [12:42:21] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Data, and 5 others: EventLogging PHP EventServiceClient should use EventBus->send(). - https://phabricator.wikimedia.org/T272863 (10hashar) On the dummy change https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/6589... [13:02:39] !log Copy /wmf/data/event to backup cluster (30Tb) - T272846 [13:02:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:02:42] T272846: Backup HDFS data before BigTop upgrade - https://phabricator.wikimedia.org/T272846 [13:36:14] hey a-team, good morning/afternoon/evening . I'm having issues with the pageviews API, it works correctly from the browser, but using python I get this error: https://pastebin.pl/view/7fa0efeb [13:36:32] I'm wondering if there is any user agent issue [13:37:29] hi dsaez :] looking into this [13:38:03] thx mforns [13:38:32] dsaez: yes it is me [13:39:23] or it should be me, let's try to see :) [13:39:33] are you using python-requests? [13:39:52] because we added a specific block in Varnish the other day after a big surge in traffic [13:40:05] following https://meta.wikimedia.org/wiki/User-Agent_policy [13:40:20] so the block returns a 403 in this case but it should mention the UA policy [13:40:26] that I don't see in your paste [13:40:31] what is the HTTP error code returneD? [13:41:09] also, can you give us the link to check? [13:41:16] 10Analytics: Add user to analytics-privatedata-users group - https://phabricator.wikimedia.org/T273058 (10gmodena) [13:41:35] elukey: it's a 403 [13:41:43] yep 403 [13:41:56] I'm using requests [13:42:13] for example https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/de.wikipedia/all-access/user/Johann_Wolfgang_von_Goethe/daily/2015101300/2015102700 [13:42:19] that works from the browser [13:42:22] the error message mentions indeed: Scripted requests from your IP have been blocked, please see https://meta.wikimedia.org/wiki/User-Agent_policy. [13:43:39] but this: requests.get(that_url) returns the error [13:44:26] elukey, sorry, I don't get it. This is an API, so which is the expected UA ? [13:45:31] dsaez: The generic format is / () / [/ ...]. Parts that are not applicable can be omitted. See: https://meta.wikimedia.org/wiki/User-Agent_policy [13:45:42] mforns: ah didn't see that yes [13:46:16] dsaez: I think you can use requests to send user agent: [13:46:20] dsaez: You'd need to provide a UA that can tell us how to contact you in case the volume of requests is big [13:46:42] response = requests.get(url, headers = {'User-agent': 'blahblah'}) [13:47:02] this is far from perfect, we may lift the block very soon (it was due to emergency) but in general we should follow the UA policy for the APIs [13:47:51] got it. Sounds very strict, I've done two calls. [13:49:02] yes yes I know, we also have to figure out throttling, it is a temporary measure [13:49:15] but in the long term we suggest to everybody to use a proper UA [13:49:28] got it [13:49:29] in fact [13:49:33] is not blocked [13:49:57] if I add the blahblah agent, is enough [13:50:08] yes please use a better UA :D [13:50:13] hahaha [13:50:14] sure [14:02:27] 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10serviceops, and 2 others: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10jijiki) [14:02:30] 10Analytics, 10Analytics-Kanban, 10serviceops, 10User-jijiki: Mechanism to flag webrequests as "debug" - https://phabricator.wikimedia.org/T263683 (10jijiki) 05Open→03Resolved @Milimetric patch is merged! We are setting debug=1 in the X-Analytics header if "X-Wikimedia-Debug" is present. Thank you fo... [14:03:50] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add client TCP source port to webrequest - https://phabricator.wikimedia.org/T271953 (10jijiki) Debug header works, we tested it with @elukey:) [14:04:57] joal: are you around?\ [14:24:29] joal: I killed the copy (client + map-reduce job), we were causing network alarms :( [14:53:07] 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10akosiaris) This is weird. I don't think we have encountered this before. ExecStop in the systemd unit file runs `ifdown ens5` but running that on the host returns ` root@kafka-test1006:... [14:53:39] 10Analytics: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Ottomata) > if we want to have two/three more Airflow instances Do we want/need this? > store a little mariadb instance on every deployment of Airflow, getting re... [15:00:13] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Analytics, 10Product-Infrastructure-Data: MEP: Should stream configurations be written in YAML? - https://phabricator.wikimedia.org/T269774 (10Ottomata) > Create a new repo for stream configs and add it as a git submodule to operations/mediaw... [15:14:06] 10Analytics: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10elukey) @Ottomata the main problem that I can see how is that multi-tenancy is not really something that Airflow does well (and the people from Polidea confirmed th... [15:17:34] 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey) @akosiaris not reliably, but today I rebooted the 4 schema VMs and one of them got back with the same issue.. [15:24:53] 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10MoritzMuehlenhoff) >>! In T273026#6780528, @akosiaris wrote: > This is weird. I don't think we have encountered this before. > > ExecStop in the systemd unit file runs `ifdown ens5` but... [15:28:31] heya elukey [15:28:37] sorry I'm with kids [15:28:41] good that you killed it [15:28:52] Let's review togother when I have time [15:30:46] ack! I just pinged if you were around, I used the hammer :D [15:38:14] 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10akosiaris) >>! In T273026#6780640, @MoritzMuehlenhoff wrote: >>>! In T273026#6780528, @akosiaris wrote: >> This is weird. I don't think we have encountered this before. >> >> ExecStop in... [15:39:11] 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10elukey) I recall VMs only from my past experience, I encountered this problem a couple of times before this one. [15:45:08] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10JMeybohm) >>! In T269160#6777382, @elukey wrote: > Waiting for @JMeybohm'... [15:46:24] 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10elukey) @Ottomata @razzi this is the first datanode disk failure after the change that I made to use facter to populate the available partitions that Yarn and HDFS can use on a given worker node. In... [15:47:22] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) >>! In T269160#6780685, @JMeybohm wrote: >>>! In T269160#6777382,... [15:47:31] 10Analytics, 10Event-Platform: Rematerialise all event schemas with enforceNumericBounds: true - https://phabricator.wikimedia.org/T273069 (10Ottomata) [15:47:37] 10Analytics, 10Event-Platform: Rematerialize all event schemas with enforceNumericBounds: true - https://phabricator.wikimedia.org/T273069 (10Ottomata) [15:49:36] 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10MoritzMuehlenhoff) >>! In T273026#6780670, @akosiaris wrote: > Do you by any chance remember if it was on VMs only? Or was it physical hosts too? From my memory only VMs. I've checked my... [15:50:54] ottomata: if you are ok I'd heml eventstreams-internal! [15:51:30] cd /srv/deployment-charts/helmfile.d/services/eventstreams-internal; helmfile -e codfw -i apply [15:51:33] and then eqiad [15:51:39] does it sound ok? [15:51:58] go for it! [15:51:59] yes! [15:52:10] (no lvs yet, right? [15:52:11] ) [15:52:49] not sure if i can test very easily without, would have to do some curl --resolve magic and look up lots of stuff, but if the kube logs look good we can assume i tworks [15:52:58] will look at logs after you apply [15:53:30] no lvs exactly [15:54:56] ok we can start with [15:54:56] Error: pods is forbidden: User "eventstreams-internal" cannot list resource "pods" in API group "" in the namespace "eventstreams-internal" [15:56:56] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) ` Error: pods is forbidden: User "eventstreams-internal" cannot l... [16:00:39] 10Analytics, 10Performance-Team: Coal graphs died around 2021-01-26 20:50 UTC - https://phabricator.wikimedia.org/T273033 (10Gilles) 05Open→03Resolved a:03Gilles Restarting coal fixed the data, as expected: {F34044291} [16:01:02] ah I may know why [16:01:26] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10JMeybohm) You probably have not yet depoyed the admin part (the new names... [16:04:19] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) >>! In T269160#6780761, @JMeybohm wrote: > You probably have not... [16:06:48] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10JMeybohm) Apart from you testing my attention again (kube_env admin [codf... [16:13:09] (03PS1) 10Mforns: Make HiveToDruid return exit code when deployMode=client [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/659017 (https://phabricator.wikimedia.org/T271568) [16:14:23] elukey: should we start refering to presto as trino? [16:14:34] was thikning about adding presto support to wmfdata python [16:14:40] looked fro client [16:14:41] https://github.com/trinodb/trino-python-client [16:14:44] looks like the one maybe [16:15:19] ottomata: to avoid too much work, I'd just upgrade to the latest presto (fb presto) and then think about migrating to trino later [16:15:31] I thought we agreed on this during a standup :D [16:19:06] 10Analytics, 10Patch-For-Review: Follow up on Druid alarms not firing when Druid indexations were failing due to permission issues - https://phabricator.wikimedia.org/T271568 (10mforns) After some tests, I think the problem lies in the code: ` if (spark.conf.get("spark.master") != "yarn") { sys.exit(if (su... [16:19:45] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (10Ottomata) [16:19:50] elukey: my memory is poor [16:20:09] if did presto in wmfdata then, should I use https://github.com/prestodb/presto-python-client instead? [16:20:30] trino one has more recent commits [16:20:52] yes that client should be ok in my opinion [16:21:19] to clarify - if we want to move to trino I am 100% onboard, it seemed only too much for us [16:21:38] but if you want to move to trino +1 [16:23:04] elukey: naw i'm not trying to expedite move to it [16:23:12] just wondering what our language should be, but [16:23:21] it sounds like for my q: we should keep saying 'presto' [16:23:27] i can use a trino client now [16:23:38] and later when we change rename to 'trino' in wmfdata [16:23:40] e.g. ^ [16:32:48] Here I am [16:33:18] elukey: Hi :) [16:33:29] elukey: I'm sorry again about the network mess :( [16:36:00] razzi: Hello :) would you have a minute for me? [16:47:52] joal: not your fault :) [16:48:00] I wondered :S [16:48:06] :) [16:48:10] no I mean it was the data copy [16:48:14] I probably shouldn't be back :) [16:48:19] but you didn't really do it on purpose [16:48:20] It was elukey [16:48:23] so not your fault :) [16:48:25] well, I did! [16:48:35] uffff [16:48:47] I strongly disagree :D [16:49:00] We knew it would put load on the network - We just didn't know how much and how much was too much :) [16:49:01] but I cannot really convince you otherwise :D [16:49:04] hehehe :) [16:49:27] anyway - Shall I try with half the number of mappers? [16:50:47] elukey: --^ [16:51:38] (03CR) 10Elukey: [C: 03+1] "Completely ignorant about this but the option looks present for 2.4 and it makes sense to me, thanks Marcel!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/659017 (https://phabricator.wikimedia.org/T271568) (owner: 10Mforns) [16:52:14] thanks for the CR elukey :] [16:53:38] joal: yes let's try! [16:53:40] (03CR) 10Joal: [C: 03+1] "LGTM!Thanks @mforns" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/659017 (https://phabricator.wikimedia.org/T271568) (owner: 10Mforns) [16:53:46] ack elukey - launching the thing [16:53:51] joal: is there a way to throttle it a bit too? [16:54:17] thx for CR joal, do you know why we are not returning exit code inside YARN? [16:55:28] mforns: I imagine we could, but there would be no way to actually take advantage of it I think [16:56:26] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) es-internal deployed in both eqiad and codfw, next steps are: -... [16:56:37] joal: aha [16:58:30] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (10mforns) [16:58:40] 10Analytics: Filter out webrequest where debug=1 from pageview - https://phabricator.wikimedia.org/T273083 (10JAllemandou) [17:00:11] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) @elukey [[ https://logstash.wikimedia.org/goto/b408da9f4b39f66a... [17:01:58] fdans: milimetric joal yoohoo! [17:02:04] elukey: file-listing done, actual copy starting [17:02:12] ottomata: tuning-session! [17:02:23] oh ho ok [17:02:53] elukey: 8.8M files to be copied [17:03:33] 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10Legoktm) p:05Triage→03Low [17:04:31] elukey: I also have a question when ou have a minute [17:05:43] joal: ping standup? [17:05:58] mforns: tuning session? shall I maybe not be there? [17:06:06] fdans: --^ ? [17:06:12] oh! [17:10:29] 10Analytics, 10SRE: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10akosiaris) I 'll take your word for it. +1 on the cleanup thing. [17:13:40] Amir1: the client_port flag is now in new webrequest data, so if you need to check/use it you can :) [17:13:50] what is the ideal use case? Query via Superset? [17:13:55] or do you use hive via cli? [17:13:59] (or even presto) [17:14:00] Awesome [17:14:05] I do hive [17:14:07] beeline [17:14:19] perfect [17:14:23] Amir1: I suggest you try spark ;) [17:14:29] I need to ask the cu in ukwiki [17:14:46] usually yes but this one is a specific problem :D [17:15:03] Thank you! [17:15:28] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (10Ottomata) a:05Gilles→03Ottomata [17:18:15] elukey: I just restarted the copy job - I realized I messed up and had not changed the number of mappers :( [17:19:12] https://phabricator.wikimedia.org/T265692#6781099 let the CU know [17:27:08] 10Analytics, 10SRE: archiva artifact links point to 127.0.0.1 - https://phabricator.wikimedia.org/T164993 (10elukey) [17:34:50] razzi: not sure if you got my previous ping with the irc issues - trying again [17:35:37] joal: didn't see the ping, please go again :) [17:35:43] Hi razzi :) [17:35:49] I have a questio [17:35:53] if you have a minute [17:36:07] indeed I do [17:36:28] razzi: Can you confirm that user eyener is in analytics-privatedata-users group? [17:36:50] I think elukey told me 10 times ho to do it, and I still can't recall :( [17:37:59] joal: I can confirm that user is in analytics-privatedata-users by running `groups eyener` [17:38:54] ack razzi - I wouldn't have expected I can run groups command as not-root - Thanks a lot!! [17:39:19] you're welcome :) [17:56:23] eyener: Hi! I'm reading your comment on the presto error ticket [18:03:03] razzi: do you want to reboot an-launcher1002? [18:03:36] elukey: yeah, bc? [18:03:58] elukey: or maybe it's not that involved and we can do so async [18:04:09] razzi: I think that we can do it in here if you are ok [18:05:57] razzi: to recap - first thing is to check what's running with 'systemctl list-timers' [18:06:33] we have to identify the prefixes to stop [18:06:44] ah also, let's disable puppet [18:06:58] with something like "Razzi - prepping for reboot" [18:07:06] elukey: sounds good [18:07:13] one first example could be [18:07:27] sudo systemctl stop 'reportupdater-*.timer' [18:07:41] the important bit here is the .timer at the end [18:07:59] since if you do stop reportupdater-* you'll target the service, that might be running [18:08:06] we want to stop scheduled executions [18:08:16] (and basically gently draining) [18:08:41] eventually you'll end up with systemctl list-timers showing only system level timers [18:08:44] like logrotate etc.. [18:08:47] that are fine to run [18:09:04] once done, we'll need to check if any java/python processes are running [18:09:20] if yes, let's wait until the finish, otherwise green light to reboot [18:09:33] then puppet enable + run and the maintenance is done :) [18:09:35] I don't see reportupdater- timers in systemctl list-timers [18:09:49] Wed 2021-01-27 19:00:00 UTC 53min left Wed 2021-01-27 18:00:00 UTC 6min ago reportupdater-browser.timer [18:10:03] on what host are you? [18:10:09] :) an-master oops [18:10:18] ah yes it makes sense then :D [18:10:35] Reenabled puppet, now on to an-launcher1002 [18:10:51] 10Analytics-Radar: Presto error in Superest - only when grouping - https://phabricator.wikimedia.org/T270503 (10JAllemandou) Hi @EYener > presto error: Failed to list directory: hdfs://analytics-hadoop/wmf/data/event_sanitized/CentralNoticeBannerHistory/year=2021/month=1/day=9/hour=21 I have not experienced th... [18:12:42] elukey: are we still ok in term of network? [18:14:36] joal: it seems so yes, no complains about links saturation [18:14:46] ack elukey - thanks for checking [18:15:02] elukey: something else if you may? [18:15:58] joal: sure what's up [18:16:44] elukey: we're gonna need all users setup on the backup cluster :( [18:17:16] elukey: the /user folder is looking wrong despite me having resync [18:17:24] in terms of ownership [18:17:36] joal: you wiped and re-copied right? [18:17:52] elukey: I distcp -update - which should do the same [18:18:53] joal Awesome! You fixed it! :) I am not sure what the issue was but every chart in that dash was failing to load yesterday [18:19:20] eyener: eh :) Fixing without touching is m prefered way - ususaly doesn't work though :) [18:20:02] thanks for letting me know eyener - sorry for no good answer on updating charts (yet) [18:21:20] Ha no worries joal - appreciate you checking it out. I've asked in the Superset slack workspace as well and haven't received a reply but I'll let you know if I ever figure it out [18:21:29] maybe some jinja templating or something...? [18:22:11] very possible eyener - /me is no superset ninja for sure [18:22:13] joal: not sure, have you tried to explicitly wipe and copy a single user dir? Just to see if perms are weird [18:22:28] in theory users are already deployed on the cluster, on all nodes [18:22:31] masters + workers [18:22:35] MQH [18:22:37] MEH [18:22:50] 10Analytics, 10Product-Infrastructure-Data, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Create a separate logstash ElasticSearch index for schemaed events - https://phabricator.wikimedia.org/T265938 (10Ottomata) In a meeting with devs doing client error logging today, we realized that conf... [18:23:06] elukey: I'll try wipe-out for real and see if it changes anything [18:23:29] elukey: and I'll use 64 mappers as my basis [18:24:01] perfect thanks [18:24:11] if it doesn't work we can check again but it is weird [18:24:16] sure elukey [18:24:33] thanks for confirming that the hardware should be ready [18:25:15] joal, I believe the changes you did to hdfs cleaner need to be deployed? [18:25:40] mforns: I think elukey did? [18:25:47] maybe not? [18:26:01] joal: isn't the hdfs cleaner in refinery repo? [18:26:20] mforns: yep the three timers have been deployed [18:26:34] ok elukey thanks [18:26:38] mforns: I have not changed the code - only added puppet stuff :) [18:26:46] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10wiki_willy) Hi @elukey - thanks for the mapping. What makes it tough is that the remaining 6x hosts need to be on 10g switches, which really limits our op... [18:26:46] ok ok :] [18:26:54] thanks for checking mforns [18:31:52] mforns: in theory I should be on-call now right? Anything to handover? [18:31:59] forgot to ask during standup [18:32:06] razzi: how are things going? [18:32:15] elukey: no no, it's tomorrow [18:32:33] elukey: good, have stopped some more timers, still going through the list [18:32:45] okok [18:36:47] I believe the following services should be kept, am I missing any? [18:36:47] export_smart_data_dump.service [18:36:47] logrotate.service [18:36:47] man-db.service [18:36:47] systemd-tmpfiles-clean.service [18:38:19] oh and apt-daily.service and apt-daily-upgrade.service [18:38:30] yes yes [18:38:45] the only one that you missed is the hdfs-cleaner-* [18:38:59] those are the periodical jobs that clean up some dirs in hdfs [18:40:29] razzi: --^ [18:41:36] cool [18:45:34] razzi: can you stop them? [18:45:45] so we can proceed with the next steps :) [18:46:08] yes yes, got distracted [18:47:51] razzi: also there are mediawiki-* and hdfs-balancer [18:48:30] we should really think about changing the names, adding something line analytics- in front [18:48:34] How about prometheus-nic-firmware-textfile / prometheus_intel_microcode? [18:49:17] those are fine, the prometheus exporters can be left aside [18:49:23] they just expose metrics [18:49:30] ok cool [18:50:00] then we need to make sure that no java/python processes are running, and if so we'd need to wait [18:50:33] so wait should hdfs-balancer and mediawiki* be stopped? [18:54:05] yep yep [18:54:16] those don't need to run while we reboot [18:55:50] ok should be all set to reboot [18:57:10] razzi: what about java/python processes running? [18:57:16] oh right [18:57:38] also you didn't stop the hdfs-cleaner timers [18:58:21] ottomata: not sure if I need a +1 for these, but just in case, can you look? :] [18:58:23] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/659022 [18:58:28] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/658426 [19:00:15] elukey: alright, stopped [19:01:59] We would like to run a spark job that downloads all commons images from swift and stores the base64 image bytes in a column on hdfs, there will be roughly 7TB of data. Is there a recommended folder to store such a dataset, ie so that the size will not cause problems and it is available for others on the team? [19:04:28] fkaelin: I suspect that you are working with Miriam :D [19:05:34] fkaelin: so there are a couple of things to check - how many files are we talking about? (the hdfs namenode suffers a bit when we add million files more etc..) [19:05:45] I am more concerned about that than those 7TB of space [19:06:04] razzi: so next steps? :) [19:06:31] I see a couple of python processes: eventlogging_to_druid_navigationtiming_hourly and eventlogging_to_druid_navigationtiming_daily [19:06:31] and a couple java ones org.wikimedia.analytics.refinery.job.HiveToDruid [19:06:31] No idea how long they'll take to finish [19:06:47] razzi: perfect [19:06:54] one thing to check is when they started [19:07:02] one on Jan25 [19:07:13] the other on Jan21 [19:07:30] or not sorry lemme check better [19:07:34] I might say something silly [19:08:20] mmm yes weird they have been running for a while [19:08:50] mforns: holaaaaa [19:08:55] do you have a min? [19:09:36] the navtiming hourly + daily hive2druid indexations seem taking a lot of times, they started hours and hours ago [19:09:45] has it ever happened that they got stuck? [19:10:17] elukey yes, that is work with miriam. the image bytes will be stored as byte64 encoded strings in a schema, so the number of files depends on the whatever blocksize hadoop/spark chooses [19:11:39] fkaelin: okok so 7TB is a bit but we have a lot of space, and it is a one off, the only thing that we should check is how many files will be generated.. if it is say 10 millions it might be a problem, if we are taking about few thousand I think it is fine [19:12:20] fkaelin: can we run a test on a subset of data to see how many files are generated? [19:13:26] our blocksize for hadoop is 256M IIRC [19:14:06] razzi: since we cannot leave things stopped for so much, let's reboot an-launcher1002 [19:14:15] those two jobs seem stuck [19:14:25] (we need to downtime first) [19:19:50] razzi: I am rebooting the node myself, we should not wait this long [19:20:13] we stopped camus for a long time and when it restart it lags for a while [19:20:26] so when doing maintenance let's focus on the task please :) [19:21:37] elukey: alright yeah [19:24:49] elukey: in meeting! it finishes in 25mins [19:25:03] mforns: all good! We can follow up tomorrow [19:25:27] elukey: but yes, it happened start of the year! [19:25:42] sigh :( [19:25:55] razzi: ok host is up, can you re-enable and run puppet? [19:27:16] elukey: re-enable timers via systemctl start? [19:27:35] razzi: a puppet run is sufficient to restore all puppet-defined timers [19:27:50] gotcha, that makes sense [19:27:56] sorry mforns looks like you got em +ed :) [19:28:12] ottomata: yes, no problemo, they deployed :] [19:29:04] razzi: hm i did migrate a bunch of navigationtiming data to event platform today! [19:29:11] i wouldn't expect it tocause issue [19:29:13] but..would it? [19:29:32] mforns: can you think of anything i hive to druid that woudl need to be cahnged to deal with events with migrated schema? [19:29:37] the hive table was migrated yesterday [19:29:39] one job was stuck since the 21st :( [19:29:46] the other from the 25th [19:29:55] ottomata: on meeting, but will repond in a bit! [19:30:13] hm yeah i didn't touch navigation timing until yesterday [19:30:15] also meting! :) [19:30:24] all right I am going to dinner, ttl! [19:30:29] cya elukey [19:31:07] l8rs [19:31:23] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Product-Infrastructure-Data: Roll-up raw sessionTick data into distribution - https://phabricator.wikimedia.org/T271455 (10sdkim) a:05mforns→03Mayakp.wiki [19:37:22] * razzi afk for lunch [20:02:37] gone for tonight team - see you tomorrow [20:09:03] elukey for the tests I used the default blocksize which seems to be 64mb. So for 7TB of data we are looking at ~100k files, or ~25k if we set the blocksize to 256MB. [20:12:26] elukey the job will run over a couple days on a small number of workers (aiming for ~100qps to swift), so the hdfs files will be created at a slow pace. [20:27:16] 10Analytics: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10EBernhardson) Concur with regard to multi-tenancy, I tried to setup our airflow initially in a way that used the builtin multi-tenancy but as soon as I started inte... [20:50:34] ottomata: I don't see any thing that would need to be changed for HiveToDruid re. migrated schemas... [20:50:58] ottomata: maybe the only thing would be the meaning of dt field [20:51:22] but IIUC the meaning of dt does not change rright? [20:52:28] and all other fields are available with the same name in a backwards compatible way... so, I'd say no changes needed [20:54:26] joal if you're around, I'm getting another iteration of the `presto error: Failed to list directory: hdfs://analytics-hadoop/wmf/data/event_sanitized/CentralNoticeBannerHistory/year=2021/month=1/day=9/hour=1` error when I try to edit the Banner History dash [20:57:45] (03PS1) 10Mforns: Add en.wikidata to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/659081 [20:59:58] mforns: no its the same with legacy data [21:00:07] dt only means event time for new schemas [21:02:49] I see ottomata, HiveToDruid will work for new schemas the same, the only difference (if we want to use a time field other than dt for a given dataset) would be we have to explicitly specify it from druid_load.pp (which is already supported) [21:03:03] cool! [21:08:05] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 2 others: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 (10mforns) [21:08:49] 10Analytics, 10Analytics-Kanban, 10Event-Platform: MobileWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T267347 (10mforns) [21:09:00] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: DesktopWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T271164 (10mforns) [21:15:58] 10Analytics, 10Better Use Of Data: Create Oozie job for session length - https://phabricator.wikimedia.org/T273116 (10mforns) [21:33:53] fkaelin: default block size shoudl be 256MB [21:33:55] https://yarn.wikimedia.org/conf [21:33:59] dfs.blocksize [21:38:43] mforns: tomorrow my morningg i'm going to migrate my nav timing schemas to all wikis, if you are around we can do yours at the same time (without a deployment window)