[21:47:04] isaacj: what's up with Shona language and 20% of the traffic coming from china? [21:47:06] * leila thinks [21:47:29] yeah - i haven't looked into it but my guess is scrapers getting identified as users [21:49:38] leila: i keep on bumping up against this issue in small ways but it hasn't gotten to the point where it affects an analysis substantially so i haven't acted on it, but at some point, i might add some observations about where this seems to matter here: https://phabricator.wikimedia.org/T138207 [21:51:22] related: a substantial proportion of user traffic with referers that we characterize as "other - external" (which should mean things like Reddit etc.) is actually search engines that we haven't listed as such or scrapers that aren't being caught because they don't label themselves appropriately [21:54:01] isaacj: please do re T138207. nuria is working on some version of that task and I'm sure she would be happy to receive observations. [21:54:02] T138207: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207 [21:55:05] isaacj: this latter point you've observed is important to catch. We should keep it for the next Analytics hangtime or just create a task for it to capture them as you learn more about them. [21:55:36] isaacj: don't feel you have to fix everything though. ;) The branching can continue forever. [21:56:55] leila: sounds good. and yeah, i certainly don't view it as an easily solvable problem because it's very much a game of whack-a-mole, but might be worth updating our code to improve things at least for a while [22:19:21] isaacj: ah bots, how i wish i had time to finish that , it is really not that far away, but yeah, have in mind that for some times/some sites unidentified bot traffic is as high 10% [22:19:32] isaacj: * as high as 10% [22:31:27] nuria: oof yeah. and the fact that a lot of this unidentified bot traffic comes from countries (T195880#4429156) without large user populations to counteract the noise definitely makes it a really important caveat for certain analyses [22:31:28] T195880: % of "none" referers seems too high - https://phabricator.wikimedia.org/T195880 [22:33:12] isaacj: ya, agreed totally. But this sound not affect your surveys (or edits) but rather pageviews or EL data that is pageview related like page previews or citation usage [22:33:30] isaacj: are you seeing an effect on your surveys too? [22:37:57] nuria: yeah, it's not that the bots are messing with our surveys, but we're at a stage where we're trying to decide what languages have enough readers where we could reasonably launch a survey and what countries the survey would reach, so we've been looking at country-of-origin of page views to different language editions and Shona stood out because 22% of page views were coming from China despite shona being associated with Zimbabwe [22:38:42] of course, it could be real traffic too: https://en.wikipedia.org/wiki/China%E2%80%93Zimbabwe_relations [22:39:13] isaacj: i see , let me see shona wikipedia is showiki? [22:39:48] nuria: sn [22:41:34] isaacj: this one right? https://sn.wikipedia.org/wiki/Peji_Rekutanga [22:42:03] yep, and i was just looking at Erik's stats on https://stats.wikimedia.org/wikimedia/animations/wivivi/wivivi.html [22:47:07] isaacj: https://bit.ly/2CeqIPd [22:47:27] isaacj: most definitely bots [22:50:47] isaacj: bots on desktop site with no referrer big percentage of which have an IE UA [22:51:17] nuria: hah, yeah, wow... does seem that some amount of the traffic from China is real in the sense that there's a baseline small amount. the correlation between France and China is odd too [22:54:02] isaacj: normally when you look at that traffic in detail pattern emerges [22:54:24] nuria: i also always forget how useful turnilo is for this [22:55:19] isaacj: for bots ya, works great, cause you can see the same spike in different lights: per country, per UA