[00:09:29] PROBLEM - Check the last execution of hadoop-namenode-backup-fetchimage on an-worker1124 is CRITICAL: CRITICAL: Status of the systemd unit hadoop-namenode-backup-fetchimage https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:57:32] 10Analytics, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10MW-1.36-notes (1.36.0-wmf.30; 2021-02-09), 10Patch-For-Review: eventgate_validation_error for NewcomerTask, HomepageTask, and HomepageVisit schemas - https://phabricator.wikimedia.org/T273700 (10Tgr) >>! In T273700#6801395, @Ottomata wro... [06:57:36] good morning! [07:43:15] so I have checked a bit the number of threads that we use for the namenode to handle RPCs [07:43:18] and it is 115 [07:43:48] in https://issues.apache.org/jira/browse/HDFS-12545 there are some heuristic about how to calculate it, and we are more or less inline [07:45:54] we may want to bump it a little after adding the 24 nodes [07:45:58] but the max is 200 [07:46:19] I really think that we should split the RPC queue from block reports and zk-failover control [07:49:34] makes sense elukey [07:49:37] good morning [07:53:05] elukey: I assume the alert from the backup cluster comes from not enough space being available :( [07:59:14] joal: yes yes my bad, I already sent a code change, we don't have dedicated space for a lot of fsimage backups [07:59:24] nothing related to the data that you are copying :) [07:59:33] Ah ok - no problem, I was wondering :) [08:00:09] elukey: with your approval I launch new copies with more resources, trying to speed up the process and get a feel of how fast we can be - I'll stop jobs if namenode complains [08:01:38] joal: gimme 5 mins and you'll also have +48TB :) [08:01:44] \o/ [08:02:19] elukey: I wonder if it would be a good idea to lauch a manual hdfs-balancer with high throughput when the new node will be there [08:02:35] probably yes, it will help [08:02:48] ok, let's do that [08:03:18] joal: just check that another one is not running [08:03:41] of course - I'll ask you to stop the regular timer, and then we'll manually launch a stronger one [08:03:57] yep it is running [08:04:00] actually elukey, do we need a puppet patch? [08:06:40] joal: what I did is to stop puppet on an-worker1118, stop hdfs-balancer and hdfs-balancer.timer [08:06:48] perfect :) [08:07:34] joal: I can manually modify the timer and re-run it, should be fine, then puppet will auto-correct it [08:07:47] works for me [08:07:49] atm we use [08:07:50] hdfs dfsadmin -setBalancerBandwidth $((40*1048576)) [08:08:01] /usr/bin/hdfs balancer -threshold 5 [08:08:28] looking at params [08:08:31] https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=25&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-hadoop_cluster=analytics-backup-hadoop&var-worker=All [08:08:37] we have moar space [08:08:44] \o/ MOAR! [08:08:54] !log move an-worker1117 from Hadoop Analytics to Hadoop Backup [08:08:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:09:56] elukey: We should make the bandwidth greater - There is no other use on the cluster, so I'd say 1G [08:10:01] Do you think it;s ok? [08:14:23] elukey: I have found this example of command: hdfs balancer -Ddfs.balancer.movedWinWidth=5400000 -Ddfs.balancer.moverThreads=1000 -Ddfs.balancer.dispatcherThreads=200 -Ddfs.datanode.balance.max.concurrent.moves=5 -Ddfs.balance.bandwidthPerSec=100000000 -Ddfs.balancer.max-size-to-move=10737418240 -threshold 5 [08:16:45] joal: we can try [08:17:01] elukey: give a minute, I'm reading an interesting post: https://www.linkedin.com/pulse/increasing-hdfs-balancer-performance-doddy-sebastianus/ [08:17:21] sure [08:18:37] ah forgot to say, prestodb merged my patch, but it will not go into 0.247 [08:18:50] in theory they will prep another release in a couple of weeks [08:18:52] ok elukey - so, I'd go for: hdfs dfsadmin -setBalancerBandwidth $((1024*1048576)) [08:19:02] This is great elukey (presto) [08:19:14] then: hdfs balancer -Ddfs.balancer.movedWinWidth=5400000 -Ddfs.balancer.moverThreads=1000 -Ddfs.balancer.dispatcherThreads=200 [08:19:18] oops [08:20:41] without -threshold 5 right? [08:20:56] or with it? [08:21:10] no no - I messed my paste [08:21:35] here it is: sudo -u hdfs HADOOP_HEAPSIZE=4096 kerberos-run-command hdfs hdfs balancer -Ddfs.balancer.moverThreads=1000 -Ddfs.balancer.movedWinWidth=20737418240 -Ddfs.balancer.max-size-to-move=10737418240 -Ddfs.balancer.bandwidthPerSec=10737418240 -Ddfs.datanode.balance.max.concurrent.moves=50 -Ddfs.balancer.dispatcherThreads=200 -Ddfs.datanode.balance.bandwidthPerSec=10737418240 -threshold 5 [08:23:21] elukey: --^ [08:24:03] started [08:24:10] \o/ [08:24:18] if we break the network I'll blame you [08:24:19] :P [08:24:31] thanks elukey - I could have started the thing [08:25:00] elukey: checking - have you ran the change of bnadwisdth first? [08:25:30] yep yep [08:25:40] great [08:25:42] thanks :) [08:27:19] elukey: do we have a way to monitor the duration of the balancer run [08:27:22] ? [08:27:49] not sure, I think it is just taling from the logs, we don't have metrics [08:27:54] ack [08:29:04] for presto, I think that we could build the jars with my patch on top 0.246, get the jar that contains the classes that are changed (already spot which one it is) and release something like 0.246~wmf1 [08:29:42] then we skip 0.247 and from 0.248+ we should be good [08:29:53] (in this way we can upgrade presto during these days) [08:30:03] elukey: I assume this involves creating a new wmf-repo, or do we already have one? [08:30:33] also elukey, from yesterday's meeting with product-analytics: we should prepare for growth in presto usage [08:31:52] joal: nono we can release our own version as one-off, no need for a new repo [08:32:01] ok :) [08:32:05] Let's do it [08:32:26] joal: we'll fight the horde of people using presto with Alluxio :D [08:32:37] elukey: While reviewing the in-flight tickets, I was wondering if we expect bumping versions to solve our perms issues [08:33:07] joal: --verbose :D [08:33:10] elukey: for sure we will! I'm just thinking that possibly having one or two more beasts could help :) [08:33:22] hum, yes, sorry elukey :) [08:33:43] so: we're having permissions issues with presto on data, even for users bein analytics-privatedata-users [08:34:12] For instance, Erin not being able o run queries against data she used to be able to query [08:34:35] this one: https://phabricator.wikimedia.org/T270503 [08:34:56] When I open the dashboard, it works, but when she does, it doesn't [08:35:07] ah snap that one yes [08:35:09] And, we have also seen an interesting bug in Superset [08:35:48] discussed yesterday with PA: Thanks to Superset caching, people out of analytics-privatedata-users can see data they shouldn't have access [08:35:52] I wanted to bump presto to get more logging etc.. since it wasn't clear to me where the error message came from [08:35:57] the current presto logs are a mess [08:36:01] Right I remember that [08:36:08] Ok let's bump presto first [08:36:39] ah so superset caching is not working as expected, lovely [08:36:49] but dashboards or even sqllab? [08:36:59] well, it works great, but doesn't enforce permissions [08:37:07] I have not tested sqllab [08:38:04] ok so I'd say that we can open a task for it and let razzi investigate [08:38:12] since he worked on the original caching [08:38:15] works for me - Creating a task [08:39:11] I was doing some delicate work and I wasn't able to join yesterday's meeting [08:39:31] no problemo elukey - That's why I loop you in :) [08:40:01] I mentioned the upgrade next tuesday and Fran released the hashtag: #prayforanalytics [08:40:11] I wanted to ask people to use the 2FA token, will follow up on slack [08:40:25] Superset 1.0 should have more fine grained access [08:40:53] elukey: I actually think the problem comes from us not setting superset groups correctly [08:42:00] joal: ? [08:42:15] we can definitely add more, but it is a burden for us [08:42:25] elukey: indeed! [08:42:36] and the Presto dashboard problem is not easy to solve [08:42:46] elukey: caching at superset level is bound to not be permission-aware [08:42:52] my point is that we have to work with them to gradually reach a good state [08:43:12] then we have to remove caching [08:43:18] and/or follow up with upstream [08:43:23] right [08:43:25] https://preset.io/blog/2021-01-18-superset-1-0/ contains a lot of new things [08:43:39] elukey: will read! [08:43:42] upgrading to 1.0 is probably the first step [08:44:22] but I am not gonna do it, I would like Razzi to pick up this work [08:44:33] elukey: also discussed yesterday: better user-management for data access (access, fine grain security etc) will be a broader project than will need to be planned etc [08:44:37] working also with PA to improve access [08:45:11] joal: did anybody say that we have also done good things recently? Like the analytics-privatedata-users no-ssh access :D [08:45:20] Yes we have :) [08:49:23] elukey: I'm reading some alluxio - this thing is awesome :) [08:50:49] joal: I am writing a task about the caching problem if you haven't done so, so I can ping Razzi [08:51:10] I'll add it as subtask of the superset upgrades one [08:51:34] elukey: I had that in my TODO - thanks for doing it [08:51:39] ack [08:57:06] elukey: is there a place where I can check our hardware setup? (for instance presto hsots, and/or druid hosts/ and hadoop workers)? [08:57:44] joal: do you mean cpus/ram/etc.. ? [08:57:56] correct, and also disks (space+type_) [08:58:22] host-overview in grafana has most of the info, for disk type I can check [08:59:30] ack [08:59:41] I was wondering about disk-space and type of presto workers [09:00:19] joal: those are basically hadoop workers [09:00:24] Right [09:00:26] ok [09:00:27] thanks [09:01:29] !log restart superset and disable memcached caching [09:01:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:03:23] I created https://phabricator.wikimedia.org/T273850 for the caching issue [09:03:56] ack elukey - restricted view [09:04:54] joal: you should be able to see it, aren't you in wmf-nda? [09:05:17] maybe I'm not? [09:05:28] ah snap [09:05:32] let's correct this [09:11:17] created https://phabricator.wikimedia.org/T273852 [09:11:39] Thanks elukey :) [09:32:20] 10Analytics, 10Wikidata, 10Wikidata-Query-Service: Automate regular WDQS query parsing and data-extraction - https://phabricator.wikimedia.org/T273854 (10JAllemandou) [09:32:35] 10Analytics, 10Wikidata, 10Wikidata-Query-Service: Automate regular WDQS query parsing and data-extraction - https://phabricator.wikimedia.org/T273854 (10JAllemandou) a:03JAllemandou [09:38:08] 10Analytics: Build a process to check permissions when changing datasets from non-PII to PII - https://phabricator.wikimedia.org/T273818 (10JAllemandou) Obviously having Data governance will help for that one. I however would rather go for a different approach here, namely make sure the jobs rely on using `/wmf/... [09:49:20] data moves, but slowly :( [10:29:40] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema - https://phabricator.wikimedia.org/T215858 (10Zache) Very low tech workaround for decreacing the current load could be generating... [11:17:05] 10Analytics, 10Product-Analytics, 10Epic: Revamp analytics.wikimedia.org data portal & landing page - https://phabricator.wikimedia.org/T253393 (10mforns) a:05mforns→03None [11:22:31] elukey: Do you have time for us to bump AQS druid snapshot now? (patch on its way) [11:23:22] joal: sure [11:28:36] !log move aqs druid snapshot config to 2021-01 [11:28:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:28:46] 10Analytics, 10Event-Platform, 10Product-Data-Infrastructure: PrefUpdate Event Platform Migration - https://phabricator.wikimedia.org/T267348 (10mforns) Starting this migration now! [11:30:53] done! [11:31:17] (03PS1) 10Mforns: Add PrefUpdate schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/661706 (https://phabricator.wikimedia.org/T267348) [11:32:13] (03CR) 10Mforns: [V: 03+2 C: 03+2] Add PrefUpdate schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/661706 (https://phabricator.wikimedia.org/T267348) (owner: 10Mforns) [11:37:54] mforns: Sorry to hound you about the reportupdater-queries patches, it's not terribly urgent that these are handled right away, but it would be helpful if you could estimate a timeline for review. [11:39:58] RECOVERY - Check the last execution of hadoop-namenode-backup-fetchimage on an-worker1124 is OK: OK: Status of the systemd unit hadoop-namenode-backup-fetchimage https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:41:47] 10Analytics, 10observability: Modify Kafka max replica lag alert to only alert if increasing - https://phabricator.wikimedia.org/T273702 (10fgiunchedi) >>! In T273702#6800231, @Ottomata wrote: > So even though alertmanager is upcoming, should we continue to use `check_prometheus`? (Can't use `grafana_alert`,... [11:45:36] awight, will look at them later today, sorry for the delay [11:50:18] * elukey lunch! [11:50:21] hola mforns [11:54:05] heya elukey [12:42:24] * mforns going for errands, back later [13:01:16] mforns_brb: Not a problem, I just wondered about times for planning reasons. [13:01:21] Thank you for all the help! [13:17:22] 10Analytics: HDFS Namenode: use a separate port for Block Reports and Zookeeper failover - https://phabricator.wikimedia.org/T273629 (10elukey) There is a way to enable this without having the Hadoop cluster down: https://community.cloudera.com/t5/Community-Articles/How-do-you-enable-Namenode-service-RPC-port-wi... [13:25:58] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) Tested with `ssh -L 4992:eventstreams-internal.discovery.wmnet:49... [13:31:19] joal: you are in WMF-NDA now :) [13:31:29] I also added fdans and razzi [13:31:37] the rest of the team should be in it as well [13:31:57] elukey: thank you luca :) [13:33:14] <3 [13:43:05] Thanks a lot elukey :) [13:45:04] 10Analytics, 10observability, 10User-fgiunchedi: Setup Analytics team in VO/splunk oncall - https://phabricator.wikimedia.org/T273064 (10fgiunchedi) [13:46:51] 10Analytics: HDFS Namenode: use a separate port for Block Reports and Zookeeper failover - https://phabricator.wikimedia.org/T273629 (10JAllemandou) > it seems required to formatZK for both namenodes This is scary! the plan of doing it after upgrade is good to me :) [13:48:19] 10Analytics: HDFS Namenode: use a separate port for Block Reports and Zookeeper failover - https://phabricator.wikimedia.org/T273629 (10elukey) @JAllemandou in theory, IIUC, it should be possible to stop the hadoop-hdfs-zkfc daemons without risking downtime for the namenodes, and reset their state in zookeeper.... [13:48:57] 10Analytics, 10observability, 10User-fgiunchedi: Setup Analytics team in VO/splunk oncall - https://phabricator.wikimedia.org/T273064 (10fgiunchedi) @razzi @elukey @Ottomata I'd like to go over the VO setup with you and see what makes the most sense, is next week ok for a 45 min chat ? [15:01:39] Gone for kids - Back for staff meeting [15:02:07] by the way team, I assume we cancel standup and possibly take the beginning of grosking to sync? [15:02:30] lexnasser: I forgot to mention - today is staff meeting - I need to cancel our 1-1 [15:02:42] lexnasser: I'll try to reschedule :) [15:33:40] * elukey hates presto [15:44:16] too many r's? [15:47:32] I think that it is a combination of horrible docs plus confusing defaults [15:47:44] I am trying to just enable an access.log [15:47:50] that should be created by default [15:47:54] but it doesn't happen [15:48:25] I am pretty sure that I'll end up discovering the problem and throwing my laptop out of the window [15:59:18] joal: no worries, thanks for the heads up :) [15:59:36] ottomata: hey! I went today to the euro-mid-day backport window to migrate PrefUpdate on testwiki, but could not make events land in kafka, the browser was sending them to the correct URL and with the correct payload, but they would not appear in the kafka topic... thus, we reverted the change... [16:00:36] a-team delaying standup half an hour for people who are attending the monthly staff meeting [16:00:49] cool [16:05:19] mforns: Hi hm [16:05:25] hey :] [16:05:41] Hmmmm mforns were they invalid? [16:05:47] ottomata: can it have something to do with the following problem?: [16:06:17] when I merge the changes in secondary repo, it takes a long time until I can evolve the hive tables. [16:06:33] mforns: it should take however long it takes puppet to run, and i can run it manualy for ya [16:06:38] before that, I keep getting errors that the schema is not available unther the URL in secondary repo [16:06:54] today it took about 25 minutes [16:07:08] but even then, it fails eventually with the same error [16:07:13] that sonds about right mforns , puppet runs about every 30 minutes [16:07:17] ok [16:07:26] a-team pls don't miss the staff meeting [16:07:34] in it :] [16:07:48] staff meeting... [16:07:59] at noon? [16:08:05] in 52 mins? [16:08:27] mforns: i see the schema now [16:08:35] you are sayin you can't evolve the hive table? [16:11:06] ottomata: I could, after waiting for puppet, but still after that, some runs of the script (with dry_run) would fail with the same error (on and off error) [16:14:12] I thought the script would grab the schema from the registry, not pull the repo locally, and thought it could be some kind of caching problem, but probably off [16:15:25] 10Analytics, 10Event-Platform, 10Wikidata, 10Wikidata-Query-Service: Automate event stream ingestion into HDFS for streams that don't use EventGate - https://phabricator.wikimedia.org/T273901 (10Ottomata) [16:16:01] mforns: there are 4 different nodes, 2 in each DC that serve schemas, and i guess puppet runs at different times on all of them [16:16:19] maybe we should change the deployment model for schema.wm.org :) [16:16:28] to something other than puppet + git pull [16:16:39] mforns: does it work now? [16:16:51] ottomata: let me check [16:18:37] ottomata: seems to work every time now. Maybe yes, also, does schema.wm.org go though varnish, does it have cache? [16:19:21] Maybe the reason it didn't work is I pushed the schema change like 20 minutes before the backport window [16:21:36] ottomata: o/ did you see the new shiny eventstreams-internal?? [16:21:54] NOT YET [16:21:58] with discovery? [16:22:01] yep! [16:22:04] all done [16:22:07] WOWOOWOW [16:22:16] I tested it earlier on and it works [16:22:18] \o/ [16:22:47] hmm how did you tunnel? [16:22:54] i'm trying [16:22:54] ssh -t -t -N -L4992:eventstreams-internal.discovery.wmnet:4992 bast1002.wikimedia.org [16:23:30] I used mwmaint1002, but without the -t -t [16:24:00] ssh -L 4992:eventstreams-internal.discovery.wmnet:4992 -N mwmaint1002.eqiad.wmne [16:24:08] a-team since there's a tech-specific follow-up to the staff meeting, I'm cancelling today's standup and grosking in the end [16:24:16] https://localhost:4992/ [16:24:17] ? [16:24:18] right? [16:24:24] yes exactly [16:24:37] hmmm not working for me elukey [16:24:50] I see the UI now [16:24:54] what do you get? [16:25:19] ssh -v -L 4992:eventstreams-internal.discovery.wmnet:4992 -N mwmaint1002.eqiad.wmnet [16:25:31] This site can’t be reachedlocalhost refused to connect. [16:25:37] i must be doing soething stupid [16:25:37] lovely [16:25:55] OHHH you know [16:26:03] i think i've been having ssh probloems recentl, i think the tunnel is not open yet [16:26:16] i've noticed its taken my ssh connections about 30-45 seconds to connect for the last few days [16:26:22] wow [16:27:15] works for me [16:27:53] now it works for me too [16:27:59] yeah, ssh is weird [16:28:02] thank you luca veyr COooOOl! [16:29:06] \o/ [16:29:21] all right we can close the task then [16:33:19] 10Analytics, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10MW-1.36-notes (1.36.0-wmf.30; 2021-02-09), 10Patch-For-Review: eventgate_validation_error for NewcomerTask, HomepageTask, and HomepageVisit schemas - https://phabricator.wikimedia.org/T273700 (10Ottomata) > I wonder if it would be hard t... [16:33:36] awsoome [16:33:46] elukey: i will write docs and send an annoucement [16:34:06] mforns: schema.wm.org does go trhough varnish [16:34:33] actualyl, not if the uri used is discovery [16:34:34] let's see [16:34:55] mforns: that would makes sense though, eventgate-analytics-external uses the same http schema repo as the hive evolve stuff does [16:35:04] so if eventgate can't get the schema it won't allow events in [16:35:30] ottomata: so even if puppet runs in all servers, could it be that the URI for the schema has been requested before that, and cached as 404? [16:35:44] hmm, for eventgate, no, it uses discovyer uri [16:35:44] https://schema.discovery.wmnet/repositories/primary/jsonschema/ [16:35:49] so does not use varnish [16:35:55] hm [16:36:07] and same for refine [16:36:40] ottomata: maybe we can do a deployment to testwiki and check if it works now? [16:36:42] and for evolve [16:36:47] mforns: ya lets do it [16:36:49] su mit patch i'll deploy [16:36:53] k [16:38:53] 10Analytics, 10observability, 10User-fgiunchedi: Setup Analytics team in VO/splunk oncall - https://phabricator.wikimedia.org/T273064 (10Ottomata) Ya! [16:40:32] fdans: i'm confused is staff meeting happeningg now? it is on my calendar starting in 20 mins [16:40:50] ottomata: is your timezone ok? [16:40:55] yeah? [16:41:17] huh i see it in wmf staff calendar [16:41:20] i don't know how google cal works [16:41:23] weird [16:42:40] ottomata: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/661746 [16:47:01] mforns: deployed to testwiki let's use the new eventstreams-internal to test the event !!! [16:47:08] ottomata: ok! [16:48:20] ottomata: hm, the stream is not there... [16:50:24] mf ottomata i see it [16:50:31] https://localhost:4992/v2/ui/#/?streams=eventlogging_PrefUpdate,eventgate-analytics-external.error.validation [16:51:49] ottomata: it landed, I saw it in kafka, however in the ui I can not choose it! [16:52:26] mforns: maybe refresh? [16:52:34] the change we just merged just declared the stream [16:52:54] i see the event i produced in the ui [16:52:58] oh ok [16:53:12] now, a filter by wiki would be helpful! [16:53:20] indeed or a general filter of any kind [16:53:24] would be pretty cooooOoOL [16:53:36] i had something like that going once a long time ago but removed it because it was just a complicated API [16:53:45] but you could lbuild it into the ui rather than in the api [16:53:57] maybe... graphql based? [16:54:08] I'll be on a call with my lawyer for the Standup and (part of) grooming. Don't go too crazy without me ;) [16:54:19] klausman: we are skipping for staff meeting stuf [16:54:26] Ah, rog [16:56:56] Presto was not logging the access log since /srv/presto/var/log was not allowing to write a file [16:57:05] without mentioning it in the logs [16:57:10] * elukey runs screaming [16:58:21] ottomata: but graphql would be a "horizontal filter" but what we're looking for is more a "vertical filter" no? Able to select which events are shown, not which fields are shown, right? [16:59:34] hm, i'd expect both are possible, no? [16:59:49] mforns: the graphql idea is cool but maybe hard and overkill [17:02:11] !log roll restart presto on an-presto* to finally get http-request.log [17:02:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:02:56] ottomata: I think a string filter would already be very powerful [17:03:18] yeah [17:03:56] maybe string with regexp [17:04:20] or maybe jq? [17:04:21] :p[ [17:04:32] or some already build json filter lib [17:04:33] * ottomata https://www.google.com/search?q=npm+jsonpath+filter&oq=npm+jsonpath+filter&aqs=chrome..69i57j69i64.1582j0j9&sourceid=chrome&ie=UTF-8 [17:04:37] !log restart presto coordinator on an-coord1001 to pick up logging settings (log to http-request.log) [17:04:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:04:56] ottomata: yes, jq sounds good, but it lacks comparison syntax no? [17:05:13] yes, a json filter lib is probably the best [17:05:24] https://www.npmjs.com/package/filter-object-array [17:05:34] pretty sure jq has comparison syntax [17:05:39] but i'm sure there are things that exist here [17:06:11] oh [17:06:31] https://lodash.com/docs/4.17.15#filter [17:07:23] / The `_.matches` iteratee shorthand. [17:07:23] _.filter(users, { 'age': 36, 'active': true }); [17:07:23] // => objects for ['barney'] [17:07:36] yes, I like that format [17:08:00] but the simple string search should be supported too, no? [17:08:25] although i guess you are just matching each object at a time [17:08:30] since you don't have an array of the whole stream [17:08:30] so [17:08:30] https://lodash.com/docs/4.17.15#matches [17:08:45] mforns: sure why not, meaning just search thhe whole json string? [17:09:36] yes, otherwise, you have to write "{'meta': {'dt': 'something something'}}" instead of just "T00:10" [17:10:11] true [17:10:21] yeah with kafkacat i ofent just grep | ottotest [17:10:24] or | grep testwiki [17:10:25] or somethign [17:10:28] yes [17:10:30] | grep testwiki | jq . [17:11:29] mforns: should we depoloy prefupdate to all wikis? [17:15:22] I'm having internet problems... [17:15:58] but, yes! I can do it in the next window if you want [17:16:07] joal: https://github.com/prestodb/presto/blob/6ff3052e9df1a04613157746e7e779fd54c0c2a1/presto-docs/src/main/sphinx/release/release-0.233.rst#hive-changes - I was wrong, the hive.force-local-scheduling is gone (in favor of another one) [17:16:18] so it was not presto complaining because we set a default value, my bad [17:20:17] ack elukey [17:20:27] 10Analytics, 10Patch-For-Review: Decide to move or not to PrestoSQL - https://phabricator.wikimedia.org/T266640 (10elukey) Some updates: - my patch was merged by upstream, but it will not go out in 0.247, but probably in 0.248 - the option `hive.force-local-scheduling` was [[ https://github.com/prestodb/presto... [17:20:29] must have been something else :S [17:20:30] mforns: if tyou submit patch I can deploy iut [17:20:45] ottomata: ok will do, one sec [17:22:29] joal: we just need to drop it if we don't want to have it enabled (we currently set it as false) so all good [17:25:18] are we grooming, fdans? [17:25:54] mforns: cancelled [17:27:24] a-team I'm in the cave if y'all wanna chat [17:28:19] 10Analytics, 10Patch-For-Review: Decide to move or not to PrestoSQL - https://phabricator.wikimedia.org/T266640 (10elukey) The only weird thing is this: ` presto:wmf> select * from webrequest where webrequest_source='test_text' and year=2021 and day=20 and month=1 and hour=1 limit 10; Query 20210204_172606_00... [17:35:50] ottomata: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/661758 [17:45:29] mforns: on mwdebug1001 [17:45:35] testing on enwiki there [17:45:35] ottomata: testing [17:45:50] ok gr8 [17:50:05] ottomata: I'm testing on cawiki, and they are not landing for me [17:50:18] ottomata: oh! just landed, seems to work [17:50:29] ya just tested on enwiki and got it [17:50:35] ok mforns proceeding [17:51:56] synced mforns [17:52:00] ok! [17:55:14] eyener: o/ hi! If you have a moment can you retry to access the dashboard? [17:55:31] elukey: sure thing! [17:57:44] mforns: is PrefUpdate a server side schema? [17:58:11] ottomata: I thought not! But not sure! [17:58:21] i'm seeingg serialization validation errors [17:58:27] :[ [17:58:30] elukey: have you changed stuff on presto? [17:58:30] maybe they were always there? [17:58:31] looking [17:58:35] ottomata: should I revert? [17:58:37] elukey: I'm curious :) [17:59:17] mforns: [17:59:17] https://logstash.wikimedia.org/app/discover#/view/AXMlVWkuMQ_08tQas2Xi?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-1h,mode:quick,to:now))&_a=(columns:!(message,errored_stream_name,errored_schema_uri,meta.domain,emitter_id),filters:!(),index:'logstash-*',interval:auto,query:(language:lucene,query:'type:eventgate_validation_error%20AND%20errored_stream_name:eventlogging_PrefUpdate'),sor [17:59:17] t:!(!('@timestamp',desc))) [17:59:21] prepare patch [17:59:22] probably [17:59:23] checking [17:59:27] ok [17:59:45] joal: nono on superset, the username was registered as EYener for some reason, so I am wondering if it needs to be all lowercase, so I changed it. It is a plausible explanation why an access denied could occur [18:00:59] yeah lets roll back mforns [18:01:02] wow elukey - amazing catch [18:01:07] ottomata: ok, on it [18:01:39] joal: let's see if eyener confirms, I hope it was that (I spent hours in the presto log dungeons, we need better logging, sigh) [18:02:07] mforns: do yo uknow you can hit the revert button in gerrrit? [18:02:17] ottomata: yes [18:02:21] k [18:02:25] for some reason I did that manually [18:02:30] :) [18:02:35] haha elukey and joal I'm in Superset and adding a filter now to see if I hit an error [18:02:37] ottomata: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/661766 [18:02:38] no error! [18:02:45] waiting for jenkins [18:02:46] \o/ [18:02:55] eyener: \o/ [18:03:19] Amazing! thank you! [18:03:27] * joal bows to elukey's patience :) [18:03:44] elukey: would you give me a minute in da cave? [18:05:30] ok rolled bakc [18:06:06] eyener: do you recall more or less when you first logged in in Superset? Even high level, was it a looong time ago or recently? I need to investigate why your username was in that form :( [18:06:10] joal: sure [18:06:23] oh mforns it is eventlogging PHP [18:06:28] it even says so in the audit sheet [18:06:32] i'm sorry i told you to do it [18:06:36] sorry... [18:06:44] no no, I should have looked [18:06:49] elukey: about 1.5 years ago I'd say [18:08:16] 10Analytics, 10Event-Platform, 10Product-Data-Infrastructure, 10Patch-For-Review: PrefUpdate Event Platform Migration - https://phabricator.wikimedia.org/T267348 (10Ottomata) Oh my, this schema is produced from the Eventlogging PHP client, which does not yet work. We deployed and rolled back today, but t... [18:17:59] eyener: ah that makes sense! Thanks! [18:19:36] elukey I'll be curious to hear once you do an investigation (no rush) on what happened! [18:19:46] especially if it's user error... [18:20:19] eyener: so when you first log in, if the user is not there superset auto-creates it [18:20:39] using the uid field of your user in ldap, that is the same one used for your shell etc.. (so the one that hadoop understands) [18:20:57] I think that your first access was before the current config, and maybe something different than uid got stored [18:21:22] so up to know there was really no permission checking (since we were not using presto, and druid is not authenticated yet) [18:21:25] mforns: i'm working on a backfill [18:21:33] its only 116 events , but why not eh? [18:22:05] eyener: but now there are, and you got the error.. will try to summarize the issue in the task, and check for occurrences of the same problem for other users, but for newer ones we should be ok! [18:22:13] need to go now people, ttl! [18:22:22] thanks elukey! [18:28:45] 10Analytics, 10Event-Platform, 10Product-Data-Infrastructure, 10Patch-For-Review: PrefUpdate Event Platform Migration - https://phabricator.wikimedia.org/T267348 (10Ottomata) I was able to backfill these events into eventlogging_PrefUpdate from the eventgate-analytics-external.error.validation stream. [18:36:35] ottomata: can I help? [18:39:09] mforns: i did it! [18:39:14] wasn't so bad, [18:39:21] ok ok [18:39:46] ottomata: curious about how you did it? [18:40:02] the events weren't in kafka, right? [18:40:51] ah! error validation stream [18:40:53] ok [18:40:54] ya! [18:40:59] took the out of raw event [18:41:09] used kafkacat -Q with a timestamp to find the offset of the time we deployed [18:41:17] consumed all events grepped for PrefUpdate [18:41:25] extracted the .raw_event field [18:41:44] unesecaped it, then fixed the validation errors (removed quotes on userId and isDefaullt values [18:41:45] ) [18:41:51] then POSTed to eventgate-analyitcs-external [18:52:19] (03CR) 10Mforns: [C: 03+2] "LGTM! Let me know if you want to merge this already, and I'll do. Or merge it yourselves if you have permits!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657634 (https://phabricator.wikimedia.org/T273475) (owner: 10Awight) [18:53:54] (03CR) 10Mforns: [C: 03+2] "LGTM! Let me know when this is ready to deploy, and I will merge. Or merge yourselves if you have permits!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/661108 (https://phabricator.wikimedia.org/T273454) (owner: 10Awight) [18:59:01] (03CR) 10Mforns: [C: 03+2] "Thank you for this change! You guys are finding better ways of using reportupdater :]" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657362 (https://phabricator.wikimedia.org/T271902) (owner: 10Svantje Lilienthal) [19:06:54] (03CR) 10Mforns: Collect metrics of all wikis (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/655886 (https://phabricator.wikimedia.org/T271894) (owner: 10WMDE-Fisch) [19:08:50] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/655949 (https://phabricator.wikimedia.org/T262209) (owner: 10Awight) [19:16:30] dcausse: Hi! [19:16:36] dcausse: is it too late for you? [19:18:29] I assume so :) [19:22:38] !log rebalance kafka partitions for eventlogging_MobileWikiAppLinkPreview [19:22:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:24:55] joal: if it isn't too late for you, want to chat? [19:25:08] sure ottomata [19:25:11] batcave! [19:26:57] !log rebalance kafka partitions for codfw.mediawiki.job.wikibase-addUsagesForPage [19:27:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:27:13] !log rebalance kafka partitions for eqiad.mediawiki.job.wikibase-addUsagesForPage [19:27:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:33:30] 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi) [19:43:49] 10Analytics: How I get data visitors from Ukraine ru, en wiki? - https://phabricator.wikimedia.org/T273924 (10Reedy) [19:44:16] 10Analytics: How I get data visitors from Ukraine ru, en wiki? - https://phabricator.wikimedia.org/T273924 (10Reedy) It very much depends what data you're wanting... And what formats Some is available here: https://stats.wikimedia.org/#/ru.wikipedia.org https://stats.wikimedia.org/#/en.wikipedia.org [20:03:28] 10Analytics: How I get data visitors from Ukraine ru, en wiki? - https://phabricator.wikimedia.org/T273924 (10Aklapper) 05Open→03Invalid Hi @ua_user, thanks for taking the time to report this and welcome to Wikimedia Phabricator! This does not sound like something is wrong in the code base (a so-called "soft... [20:14:21] 10Analytics: How I get data visitors from Ukraine ru, en wiki? - https://phabricator.wikimedia.org/T273924 (10Base) @Aklapper, you are missing the process described here: #Product-Analytics [20:15:52] 10Analytics: How I get data visitors from Ukraine ru, en wiki? - https://phabricator.wikimedia.org/T273924 (10ua_user) any xml or json or any format, thuse have count page views visitors from Ukraine? [20:18:53] (03PS1) 10Ebernhardson: refinery-drop-hive-partitions: Ensure verbose logging goes somewhere [analytics/refinery] - 10https://gerrit.wikimedia.org/r/661799 [20:19:25] 10Analytics: How I get data visitors from Ukraine ru, en wiki?(top visitors 100 by count) - https://phabricator.wikimedia.org/T273924 (10ua_user) [20:21:24] 10Analytics: How I get data visitors from Ukraine ru, en wiki?(top 100 pages by visitors from Ukranian) - https://phabricator.wikimedia.org/T273924 (10ua_user) [20:33:29] 10Analytics: How I get data visitors from Ukraine ru, en wiki?(top 100 pages by visitors from Ukranian) - https://phabricator.wikimedia.org/T273924 (10ua_user) @Reedy tnx for links, look nice , but pages have total views all pages by country but i need top 100 title pages for from Ukraine visitors in EN and RU [20:40:22] 10Analytics, 10observability, 10User-fgiunchedi: Setup Analytics team in VO/splunk oncall - https://phabricator.wikimedia.org/T273064 (10CDanis) I think the change pushed today in the puppet private repo (hash b1b32d4ab) broke the [[ https://wikitech.wikimedia.org/wiki/Wikitech-static#Meta-monitoring | meta-... [20:40:53] 10Analytics: How I get data visitors from Ukraine ru, en wiki?(top 100 pages by visitors from Ukranian) - https://phabricator.wikimedia.org/T273924 (10Reedy) >>! In T273924#6805110, @Base wrote: > @Aklapper, you are missing the process described at #Product-Analytics profile page. But that wasn't seemingly foll... [21:03:04] 10Analytics: How I get data visitors from Ukraine ru, en wiki?(top 100 pages by visitors from Ukranian) - https://phabricator.wikimedia.org/T273924 (10Base) It wasn't, basically I have directed the user here without spending the 15 minutes to dig through my email to lookup the process :) I hoped Andre would help... [21:06:39] 10Analytics, 10Product-Analytics: How I get data visitors from Ukraine ru, en wiki?(top 100 pages by visitors from Ukranian) - https://phabricator.wikimedia.org/T273924 (10Base) [21:09:53] 10Analytics, 10Product-Analytics: How I get data visitors from Ukraine ru, en wiki?(top 100 pages by visitors from Ukranian) - https://phabricator.wikimedia.org/T273924 (10Base) In other words, the user wants to obtain the list of most popular articles of Russian and English Wikipedias in terms of page views f... [21:11:20] 10Analytics, 10Product-Analytics: How I get data visitors from Ukraine ru, en wiki?(top 100 pages by visitors from Ukranian) - https://phabricator.wikimedia.org/T273924 (10Base) 05Invalid→03Open [21:22:00] razzi, ottomata forgot to mention https://phabricator.wikimedia.org/T273850 [21:22:29] oh ya saw that good move [21:23:39] ack :) [21:24:00] (just wanted to make sure that if people notice the slowdown we know why) [21:27:46] 10Analytics, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10MW-1.36-notes (1.36.0-wmf.30; 2021-02-09), 10Patch-For-Review: eventgate_validation_error for NewcomerTask, HomepageTask, and HomepageVisit schemas - https://phabricator.wikimedia.org/T273700 (10Ottomata) > or if they can ssh tunnel to t... [21:28:13] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) Added some docs here: https://wikitech.wikimedia.org/wiki/Event... [21:28:31] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10EventStreams, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) Yeehaw thank you Luca! [21:46:21] 10Analytics, 10Product-Analytics: Provide a list of 100 most popular articles of Russian and English Wikipedias in terms of page views from Ukraine - https://phabricator.wikimedia.org/T273924 (10Aklapper)