[13:16:54] pfischer: hmm, regarding airflow variables those should live in the postgres db if configured from the UI. Having some sort of central bit is probably fine, although you probably need two separate ones (-main and -jumbo) [13:31:26] ebernhardson: Oh, so we would run the DAG twice, once for each target cluster? Aren’t -main topics mirrored into jumbo? [13:32:31] pfischer: events should be found in both, but the mjolnir bits that interact with mjolnir-kafka-msearch-daemon are on -jumbo [13:33:16] (alternatively, the infrastructure that caused that to be written no longer exists, so in theory mjolnir could query search direct from hadoop now, but that adds other problems) [13:42:29] ebernhardson: I am not sure I fully understand. Do you have a minute? https://meet.google.com/eww-pzrm-qad [13:42:40] pfischer: sure [13:54:49] ebernhardson: T374729 did you mean this one? [13:54:50] T374729: Use kafka-main-[eqiad|codfw].external-services.svc.cluster.local to discover kafka brokers in kafka client running in k8s - https://phabricator.wikimedia.org/T374729 [13:55:19] pfischer: oh indeed! did we already do that and i didn't notice [13:56:15] but that probably only works in k8s :( [13:56:48] Argh, so we’d have to resolve it in AirFlow first and pass it down to spark… [14:00:14] yea something like that, possible but roundabout [15:27:41] pfischer ebernhardson for the alert at https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DEventgateValidationErrors are we okay to dismiss the alert? i do see the schema validation error rate for stream = mediawiki.cirrussearch-request seems to be more elevated starting from that spike [15:27:43] https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics&var-stream=mediawiki.cirrussearch-request&var-kafka_broker=$__all&var-kafka_producer_type=$__all&var-dc=000000026&from=now-7d&to=now&timezone=utc&var-site=$__all [15:27:54] hmm [15:28:48] but was thinking maybe there's some other activity going on that makes this temporary (i sorta vaguely recall seeing a message somewhere, but am not sure where - my quick searches in the usual places aren't helping, except looking for stuff from earlier in the month and farther back, which isn't really what i'm thinking of, i think that was just a few days ago) [15:29:13] dr0ptp4kt: i will have to find the actual validation errors to see whats happening [15:30:19] looks like logstash has them, '' should NOT have additional properties. No mention of which property :P [15:30:22] ok ebernhardson - you got this or you want me to dig and re-dispatch to someone else? i'm working through ops week stuff and can get to it (just let me know if i should hustle as high prio) [15:30:40] oh my :) [15:31:02] also od dthat the thing that shouldn't have additonal properties is the empty string, seems like something is supposed to be there [15:31:46] dr0ptp4kt: i can look into it, we generally expect these to work so it would be nice for them to all go through [15:31:50] looking for "should not have additional properties" in phab seems like an assortment of random [15:32:20] thanks ebernhardson ! i'll capture our chat here and reply on the alert email with you on the To: line for your Gmail-based task prio [15:33:08] yea, from phab history it seems that empty string is intentional, or at least expected. odd but whatever [15:34:41] sigh, gmail is intermittently flapping for me [15:34:59] somewhat tangentially...i need a nice way to do yaml -> json, because our jsonschema's are in yaml and jsonschema validators want, well, json [15:35:10] i guess i can roundtrip in python? [15:35:38] in order for me to conceive of yaml correctly, i mind-parse into python data structures, then mind-parse into json :P [15:35:57] lol :) [15:36:44] hmm: Property 'context.job_type' has not been defined and the schema does not allow additional properties. [15:37:05] curious that it would only exist sometimes? will have to poke around [15:37:28] it's set to newcomerTasksCacheRefreshJob, not even clear that belongs in our event [15:42:37] ohh...it's being auto-injected via `LoggerFactory::getContext()->addScoped(...)` [15:42:48] but we can't have arbitrary code just adding properties to schema'd events [15:44:53] oh, in Extension:EventBus you mean? [15:45:09] via JobExecutor.php? [15:45:13] this happens in core, via the JobRunner. So basically any event fired from a job probably has this issue [15:45:20] (or some events follow different paths? I forget) [15:47:14] maybe it's just us doing things an old way, these events have been firing through the logging system since before EventBus/EventGate existed. Are events supposed to go through LoggerFactory still? [15:51:03] that i'm not sure about. ottomata you happen to know? i'll go file a ticket as i suspect someone will need to update an (as-yet-to-be-determined) piece of extension code, assuming it isn't a bot dirtying things up [15:51:44] poking around, i suspect we might need to switch this from shipping events via LoggerFactory to using the EventBusFactory [15:59:20] filed: T399965 [15:59:21] T399965: Error for mediawiki.cirrussearch-request: '' should NOT have additional properties - https://phabricator.wikimedia.org/T399965 [17:35:48] hmm, setting the limit at 50 distinct page loads with autocomplete ends up filtering out ~11% of daily searches. [17:38:28] putting it at 100 would be about 6.5%, 300 would filter 2%. I wonder whats appropriate [17:40:03] I'm actually a bit surprised, but i wonder if this is the wrong way of looking at it.. If we take all page views that invoke autocomplete and sort them by the number of page views with autocomplete per day, the 50th percentile performs autocomplete on 6 distinct pages [17:40:49] number of page views with autocomplete *from their ip address* per day [17:48:06] ebernhardson: Interesting question... if you sample "one autocompletion" per IP (assuming you can do that), then the outliers aren't over-represented. [17:48:08] High, but humanly plausible numbers like 300—that's every 3 minutes for 15 hours, or every minute for 5 hours, or 5 a minute for an hour.. all possible, but not normal human reader behavior—could be editors, but could also be bots, I guess. Filtering inhumanly high values—1000? 1500? 86400?—seems worth doing, though. [17:48:39] I'm curious what the highest number in your sample is, BTW? [17:48:56] hmm, sec i can pull it out [17:49:44] 99th percentile is 865 queries, max is 5921. dataset is a single day with ~3.2M page views that invoked autocomplete [17:51:00] Those are rookie numbers! [17:51:05] lol [17:51:08] So close to 6K! [17:52:37] these are the overall quantiles, i guess plausible: https://phabricator.wikimedia.org/P79412 [17:53:07] Just to be 100% clear on my end, you aren't counting individual letters typed (at least ones that trigger events)...all letters typed on one page load count as 1, right? If so, 5921 is crazy! 100 seems like a fine limit. 50 is okay, if we are willing to count everyone above that as an über-editor or bot [17:54:03] Trey314159: right, the count is technically `F.size(F.collect_set('event.pageViewId').over(w_ip)`. Since the source data is autocomplete events, In english thats the number of distinct page views that invoked autocomplete from this ip address [17:55:19] 100 seems reasonable...i suppose to be more sure i should probably investigate some of those sessions...but i could easily spend half a day doing that to little value :p [17:55:22] Holy Hockey Stick, Batman! https://usercontent.irccloud-cdn.com/file/CcaOLOb7/hockey%20stick.png [17:55:38] lol, i was thinking i should be less lazy and graph it :P [17:56:20] Being confident that you have a reasonable understanding of user behavior isn't of little value, IMHO. Dig into it for an hour and see what you see. [17:57:34] i suppose [18:09:46] hmm..there are 30k (~1%) of page views that logged the same unique pageViewId token from multiple ip addresses [18:10:40] which should be pretty hard to do by accident, its a random string plus the current timestamp in base36 [18:11:15] show how terrible ip's are for such purposes :P [18:27:53] \o [18:29:48] * ebernhardson finds it mildly amusing that the useragent `HeadlessChrome` does not trigger the useragent.is_bot flag [18:29:56] well, the browser_family [20:50:33] ryankemper you have anything for pairing? I was out the 1st half of the day and I've just been grinding on the opensearch helm chart stuff [21:14:19] inflatador: just iterating on spicerack patch, out w jango for a bit so im fine cancelling [21:29:20] ryankemper np, have a gook weekend! [21:29:24] errr...good