[07:01:03] 06Analytics-Kanban, 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3302895 (10Marostegui) 05Open>03Resolved This is now back to Optimal ``` root@db1046:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id:... [09:50:25] 10Analytics, 06Operations, 15User-Elukey: Investigate recent Kafka Burrow alarms for EventLogging - https://phabricator.wikimedia.org/T160886#3303179 (10elukey) 05Open>03Resolved a:03elukey Didn't re-occur and after a chat with Andrew we didn't find any good root cause. Inclined to close this issue as... [10:38:17] (03PS8) 10Joal: Add unique devices project-wide oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/352181 (https://phabricator.wikimedia.org/T143928) [10:39:59] all the kafka analytics brokers restarted [10:56:02] great elukey :) [10:56:20] (03PS9) 10Joal: Add unique devices project-wide oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/352181 (https://phabricator.wikimedia.org/T143928) [11:00:56] * elukey lunch! [11:26:47] (03PS10) 10Joal: Add unique devices project-wide oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/352181 (https://phabricator.wikimedia.org/T143928) [11:41:12] 06Analytics-Kanban: Update druid unique Devices Dataset to only contain hosts having more than 1000 uniques - https://phabricator.wikimedia.org/T164183#3303406 (10JAllemandou) Yes, it has ! This task can be closed. [12:00:32] fdans: I'm in the cave [12:20:03] taking a b [12:20:06] reak a-team [12:20:07] later [12:24:17] 10Analytics: Old deleted pages have empty fields in Analytics Cluster edit data - https://phabricator.wikimedia.org/T165201#3303670 (10Milimetric) you said it perfectly, @mforns. We think we got everything that was worth getting. If we're wrong on that, we definitely want to know and we'd look at the data agai... [12:25:20] 10Analytics: Improve Oozie error emails for testing - https://phabricator.wikimedia.org/T161619#3303672 (10Milimetric) that sounds like a fine way forward, you can claim this task and mark it done, and mention it at standup. [12:26:50] 10Analytics: Monitor if/when mediawiki history reconstruction partitions and imports fall out of sync - https://phabricator.wikimedia.org/T166405#3303673 (10Milimetric) 05declined>03Resolved Well, this isn't declined though, it was an issue and it was/is being fixed. [12:57:11] hi team :] [12:59:05] hi joal, elukey, :] there's nothing in the ready to deploy column, but there are a couple things in CR for the cluster, do you think it makes sense to shift cluster deployment to tomorrow? [13:04:25] mforns: usually we do deployments on Thursdays no? (but I'll be off on Friday) [13:04:49] elukey, I thought it was on wednesdays... O.o [13:09:38] I may not remember correctly, but we can definitely wait until tomorrow [13:09:54] only caveat is that if you want ops support you'll need to wait for Andrew [13:10:06] sorry I meant on Friday [13:10:20] (I'll be off) [13:29:31] (03CR) 10Mforns: [WIP] UDF to tag requests (035 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/353287 (https://phabricator.wikimedia.org/T164021) (owner: 10Nuria) [13:30:52] 10Analytics-Tech-community-metrics, 06Developer-Relations (Apr-Jun 2017): Automatically sync mediawiki-identities/wikimedia-affiliations.json DB dump file with the data available on wikimedia.biterg.io - https://phabricator.wikimedia.org/T157898#3303883 (10Albertinisg) >>! In T157898#3299959, @Aklapper wrote:... [13:33:58] 06Analytics-Kanban: Evaluate swiv and see whether outstanding pivot bugs are fixed - https://phabricator.wikimedia.org/T166320#3303915 (10mforns) a:03mforns [13:35:37] mforns: https://gerrit.wikimedia.org/r/#/c/356383 [13:35:53] elukey, looking :] [13:35:56] I'd say that the whitelist will be merged as second step, so we'll have two separate commits [13:36:35] elukey, it went back to single file [13:37:16] mforns: yeah as we discussed yesterday in standup [13:37:21] aha [13:37:25] k, reviewing [13:37:25] I packed everything to be deployed in puppet [13:43:31] mforns: I am fixing some pep8 issues in the meantime, will upload a newer code review soon [13:43:38] cool np [13:50:45] mforns: done! [13:50:51] ok! [13:57:52] ottomata: hiii! There is also https://gerrit.wikimedia.org/r/#/c/354449/ to review about role/profiles conversion (saw your message on _sec :P) [14:00:08] I am checking https://gerrit.wikimedia.org/r/#/c/355796/3/modules/confluent/manifests/kafka/broker.pp [14:01:15] k reading! [14:06:56] the kafka code review looks good to me! [14:07:14] just as curiosity - when woukd the PLAINTEXT://:9092 be used? [14:07:42] if we merged this now, it would change on current kafka clusters [14:07:54] so, the config would change, but it would be a noop, as port=9092 is the same [14:08:20] also, used in labs [14:08:51] elukey: we got ops sync up in a bit, want to talk about profiles and kafka and stuff then? [14:08:52] ah so it refreshes the config with a more up to date one [14:08:57] do we have other things? [14:09:01] sure! [14:10:00] re: other things - I have squeezed the EL purge script in one py file in puppet (https://gerrit.wikimedia.org/r/#/c/356383), should be ready for real testing after me and Marcel do some vetting on local testing environment [14:10:23] new version of zookeeper (needs testing in labs but we have trusty in there) [14:10:33] nothing more IIRc [14:12:05] trusty? [14:15:18] 10Analytics: Reinstate a subset of reports removed from the reportcard until WikiStats 2.0 is back - https://phabricator.wikimedia.org/T166679#3304049 (10DarTar) [14:17:22] deployment-zookeeper01.eqiad.wmflabs [14:17:48] do we have another instance in labs? [14:18:44] my idea was to spin up deployment-zookeeper02.eqiad.wmflabs with Jessie and migrate things over [14:19:13] (that should be only modifying https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep and running puppet on kafka nodes) [14:20:22] elukey: +1. i've got a zk in analytics [14:20:29] its easy to just spin them up though and test [14:23:45] ottomata: from stats on the zk node I can see deployment-kafka01.deployment-prep.eqiad.wmflabs. etc.. [14:23:52] is it ok to change zk for those? [14:27:21] elukey: don't flully remember what the setup there is, but all the stuff is just for testing. so, you can probably wipe whateer you need. [14:27:29] elukey: also, you could just add a new zk to the existing zk cluster there [14:27:35] and then spin down the other one? maybe? [14:27:37] actually i've never done that [14:27:44] but elukey ya, you can change whatever you want :) [14:28:02] for kafka, if you change zk clusters, you'd probably have to wipe the kafka data on disk too [14:28:20] arg [14:31:10] not hard though [14:31:42] elukey: bc? [14:33:46] ottomata: I am waiting, you already in? [14:33:59] joseph and i are here [14:34:06] what the hell [14:34:15] wrong google account elukey :) [14:34:54] joal: nono I had the wikimedia one listeed [14:53:17] mforns: only 16 comments! You are definitely not pedantic :P [15:01:07] ping elukey mforns [15:06:41] 06Analytics-Kanban, 07Easy, 13Patch-For-Review: Don't accept data from automated bots in Event Logging - https://phabricator.wikimedia.org/T67508#3304254 (10Tgr) 05Resolved>03Open >>! In T67508#3298492, @Nuria wrote: > @tgr: All calls go through varnish, there are no direct posts from php anymore (it is... [15:21:23] elukey, ping me if/whenever you want to pair on EL testing! [15:21:27] 10Analytics: Update pivot with swiv clone - https://phabricator.wikimedia.org/T166689#3304302 (10Nuria) [15:22:00] elukey, xD [15:22:11] fdans: so I'm just working on my version of "build a query from a metric object and get data back from AQS" [15:23:14] I don't think pairing would be too super useful at this stage, but let me know if you want to hang out and work together [15:23:52] milimetric: maybe 5min on the cave to have a look ans see what you're thinking? [15:24:03] sure [15:27:39] 10Analytics: Old deleted pages have empty fields in Analytics Cluster edit data - https://phabricator.wikimedia.org/T165201#3304337 (10Nuria) Sounds good, let's update docs and close this task when done. Renaming and moving to kanban. [15:28:08] 06Analytics-Kanban: Document that old deleted pages have empty fields in Analytics Cluster edit data - https://phabricator.wikimedia.org/T165201#3304340 (10Nuria) [15:29:05] mforns: let's do it tomorrow, I have one hour of meetings and the kafka restarts :( [15:29:08] ok ? [15:30:02] (03CR) 10Joal: "Some comments, but most of it is ok :)" (036 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/356125 (owner: 10Nuria) [15:30:27] nuria_: one minute on documentation for uniques? [15:38:43] (03CR) 10Joal: [C: 031] "Some minor nits in comments, but globally ok for me" (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/347653 (https://phabricator.wikimedia.org/T159727) (owner: 10Joal) [15:39:54] urandom, elukey: need to leave for some time, will probably be late at cassandra standup - apologies [15:46:41] joal: two demerits [16:16:22] 06Analytics-Kanban: Improve Oozie error emails for testing - https://phabricator.wikimedia.org/T161619#3304540 (10JAllemandou) a:03JAllemandou [16:24:57] elukey, sure, no worries [16:29:17] hey again nuria_ - around? [16:31:52] a-team: the traffic team is moving maps traffic to upload now [16:33:15] k elukey - will check camus [16:33:19] thanks for the warning [16:33:25] eqi analytics1003 [16:33:27] oops :) [16:38:24] (03CR) 10Joal: "Swo small nits - Awesome idea !" (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/356307 (owner: 10Nuria) [16:49:52] (03CR) 10Joal: "Comments inline" (035 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/355601 (https://phabricator.wikimedia.org/T162034) (owner: 10Mforns) [16:51:30] (03PS8) 10Nuria: Provide RedirectToPageview function and UDF [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/356125 (https://phabricator.wikimedia.org/T143928) [16:53:26] (03PS9) 10Nuria: Refactor PageviewDefinition to add RedirectToPageviewUDF [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/356125 (https://phabricator.wikimedia.org/T143928) [16:54:12] (03CR) 10Joal: "Same here, 2 samll nits but ok for me (a non-python guy)" (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/353309 (https://phabricator.wikimedia.org/T164497) (owner: 10Mforns) [16:55:05] (03CR) 10Nuria: Refactor PageviewDefinition to add RedirectToPageviewUDF (036 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/356125 (https://phabricator.wikimedia.org/T143928) (owner: 10Nuria) [16:55:24] joal: corrected per your comments: https://gerrit.wikimedia.org/r/#/c/356125/9 [16:55:48] joal: let me know if there were to be any issues when testing with uniques code [16:55:58] nuria_: will try now [16:57:37] joal: will also do your suggested changes to memoization cr, i think that is gotta speed the refine by some amount given that we will be executing a lot less code [16:57:52] nuria_: possible, but not sure ;) [16:58:09] nuria_: I'd love that though :) [16:58:43] nuria_: By the way the 'not sure' is about 'some amount', it will definitely speed up things, but how much ... [17:06:59] 06Analytics-Kanban, 07Easy, 13Patch-For-Review: Don't accept data from automated bots in Event Logging - https://phabricator.wikimedia.org/T67508#3304868 (10Tgr) Note that even if `EventLogging::logEvent` would forward the user agent (which it currently doesn't), filtering on that would still make no sense f... [17:10:40] 06Analytics-Kanban, 07Easy, 13Patch-For-Review: Don't accept data from automated bots in Event Logging - https://phabricator.wikimedia.org/T67508#3304898 (10Tgr) ``` mysql:research@analytics-store.eqiad.wmnet [log]> select timestamp, count(*) from MediaWikiPingback_15781718 group by substr(timestamp, 1, 8) o... [17:13:44] tgr: are you saying that all events from MW are not making it to mysql? [17:14:04] haha, phrased better: no events from MW make it to mysql? [17:14:59] ottomata: yeah, they are sent with user agent MediaWiki/ and the filter includes something like ^MediaWiki.*$ [17:15:29] tgr, ok, so i understand for sure before I merge your revert [17:16:10] server side events from mw are not being inserted into mysql because their user agent's are filtered by the is_not_bot() filter function [17:16:12] correct? [17:16:13] 10Analytics, 10Analytics-EventLogging, 06Community-Tech: Remove EventLogging for cookie blocks - https://phabricator.wikimedia.org/T166247#3304930 (10DannyH) p:05Triage>03Normal [17:17:53] hm, that's not quite true, PageDeletion e.g. seems to work [17:17:58] let me test some more [17:20:13] 10Analytics, 10Analytics-EventLogging, 06Community-Tech, 06Editing-Analysis, and 2 others: Record an EventLogging event every time a new mainspace page is created - https://phabricator.wikimedia.org/T150369#3304946 (10DannyH) [17:21:06] k [17:22:19] 10Analytics, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3304971 (10Pchelolo) @RobH The 3 day period have passed, could you please merge this as we need to gather some data in preparation to a meeting... [17:25:31] ottomata: I'm a little lost. Logging for the MediaWikiPingback and CommandLineInvocation schemas break on 5/24 when the patch was merged, PageMove/PageDeletion don't [17:26:35] 10Analytics, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3305015 (10RobH) 05Open>03Resolved Merged and puppet is running on affected servers right now. [17:26:44] mforns: I updated the code review with your suggestions, but I'd need to make more tests with different batch sizes tomorrow [17:26:46] the obvious difference between those is that Page* is called when the web server is serving a user request, pingback/CLI are called from PHP CLI mode or some such [17:26:54] mforns: will ping you tomorrow! [17:27:03] elukey, OK [17:27:05] :] [17:27:25] so it seems as if the EL beacon would somehow inherit the user agent from the request that caused the PHP code to run [17:27:54] and indeed for Page* there are human user agents in the table and for pingback not [17:28:47] but I have no clue how that user agent forwarding could happen, EL in PHP uses the HTTP::post method which sets its own useragent [17:29:52] ottomata: anyhow that still means the change broke EL logging, but to a much smaller extent than I thought (only events logged from CLI mode) [17:30:41] nuria: checked, got the exact same result from your patch and mine (count over 1 hour of webrequest) [17:31:18] nuria_: With your agreement, I'll merge [17:31:45] tgr, interesting [17:31:56] do you think we could easily make a change to the filter to fix your problem? [17:32:00] or should we do a full revert? [17:36:26] * elukey off! [17:36:33] Bye elukey [17:40:48] ottomata: I don't think there is an easy fix. We could make the EL PHP code (plus whatever code CommandLineInvocation is using) use a distinctive useragent, but MWPingback is used to collect data from 3rd party MediaWiki installations which might use old versions of MediaWiki, so that wouldn't help [17:41:45] we could remove 'MediaWiki' from the filter but that might probably result in false positives for schemas which really don't care about bots [17:42:47] also, I'm not sure what useragent the CommandLineInvocation events are using [17:43:26] hmmmm [17:43:38] nuria_: ^^ thoughts? can we remove MediaWiki from is_not_bot filter? [17:43:55] fdans: ^^ too [17:44:37] Joal: sounds good [17:44:44] joal: sounds good [17:44:51] nuria_: merging ! [17:44:52] ottomata: looking [17:45:22] (03CR) 10Joal: [C: 032] "Tested, looks good, merging!" (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/356125 (https://phabricator.wikimedia.org/T143928) (owner: 10Nuria) [17:46:09] tgr: let's try to see what schemas are broken before we revert, we can easily undo changes by removing filter [17:46:13] cc ottomata [17:46:31] tgr has found the schemas that he needs that aren't making it to mysql anymore [17:46:33] so it doesn't require a revert of code just configuration of consumers cc fdans [17:46:45] nuria_: yes, the patch he made is revert of puppet config [17:46:46] that's fine [17:46:46] nuria_: MediaWikiPingBack and CommandLineInvocation [17:46:48] we can revert that [17:47:03] but, since it is just these two schemas, i was wondering if we could just deploy a quick el fix [17:47:04] to the filter function [17:47:17] is_not_bot doesn't catch all bots anyway [17:47:19] tgr: do you look at those frequently in mysql or having access to those in hadoop for 90 days would be fine [17:47:23] I skimmed the active schemas category and nothing else jumped out as being called from the command line [17:47:25] might as well make an exception for Mediawiki bots [17:47:56] we can add the specific schemas to the filtering function [17:48:03] wait a sec [17:48:11] is that what you're suggesting ottomata ? [17:48:26] let's first make sure tgr needs this data on mysql [17:48:45] fdans: no, i was suggesting to not cconsider ^MediaWiki as a bot [17:48:48] I use PingBack but not frequently. I don't know anything about the other schema (yuvipanda probably does), I was just looking through the list [17:49:19] tgr: these changes affect the mysql consumer [17:49:23] tgr: alone [17:49:32] fwiw I still don't understand why the other backend schemas have not been broken, could be some bug in MediaWiki's Http class [17:50:02] tgr: when we moved server side events to post through varnish we tested UA was transfer [17:50:06] tgr: and it was [17:50:20] tgr: this was ori ottomata and myself a long time ago [17:50:25] yeah, but it shouldn't be :) [17:50:52] tgr: that is news to me though [17:51:13] or at least I am unable to find how that happens, it's a request done with MediaWiki's Http class and that should use the UA MediaWiki/1.30.0 or something like that [17:51:22] tgr: but let's concentrate on the schemas that you care about that are notr working properly [17:51:29] (03Merged) 10jenkins-bot: Refactor PageviewDefinition to add RedirectToPageviewUDF [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/356125 (https://phabricator.wikimedia.org/T143928) (owner: 10Nuria) [17:53:30] ottomata, fdans: if it just these two schemas i think deploying a fix to the function might be best [17:53:59] nuria_: so I personally don't mind accessing pingback data via hadoop, there might be people who get locked out that way though (since we normally use hadoop for more sensitive stuff so less people have access) [17:55:22] the schema was created by ori, my guess would be that the platform team and people doing releases (chad, mainly) are using it [17:55:23] tgr: ok then let's try to fix the db end if you think others might want to access mysql [17:55:32] as I said I have no clue about the other schema [17:56:03] looking at the fn for a second... [17:57:16] tgr: could it be this that makes it use the request UA? https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/includes/EventLogging.php#L67 [17:58:03] the code in puppet only prevents the event from proceeding to mysql. The determination of whether MediaWiki is a bot is made in the EL backend [17:58:27] duh, yes, I was being stupid [17:59:12] I was thinking about the actual user agent for the requests to the beacon but EL sends the user agent inside the JSON payload [17:59:20] so that's one mistery less [17:59:41] tgr: ok so that makes sense, now we know why the rest of schemas are not broken [18:00:32] tgr, ottomata , fdans : then given that impact is small i vote for "whitelisting" the mediawki bot? [18:01:17] nuria_: could you check what UA the CommandLineInvocation schema uses? [18:01:31] tgr: yes, looking [18:01:32] that seems to happen via some sort of shell script, not PHP [18:02:03] if this is a bot and we are whitelisting a bot it should probably go in the filter function in puppet [18:02:50] (FWIW I have no idea whether that schema is actually needed by anyone, I just went through the list of a schemas and it seemed ike something that would be affected) [18:03:46] tgr: this one is Python-urllib/2.7 [18:03:58] oh ok, so that is a different user agent. hm [18:04:04] yeah i guess whitelisting those schemas is fine [18:04:11] we can put that in the filter plugin function [18:04:29] tgr: but if logging of this schema is done via some command line tool we find out about today it is likely it will break for many other reasons [18:04:29] might want to rename that function then :p [18:05:06] ah names... [18:06:02] ok I got it [18:06:10] ottomata: the commandline invocation data i bet is not used by anyone [18:06:41] nuria_: thanks for the quick help! I'll have to be AFK for an hour or so [18:07:15] I'll pass on CommandLineInvocation, I didn't even know until today that that schema existed [18:07:30] the cloud team are probably the right people to ask about it [18:08:11] tgr: np, I think we will just "fix" mediawiki, not sure about ottomata and fdans but i think we should first make sure that other schema is used by anyone before whitelisting it [18:08:16] tgr: will ask on labs [18:09:26] +1 nuria_ [18:09:40] if we can whitelist 'mediawiki' by doing it by a user agent value rather than a schema name [18:09:42] that would be better too [18:09:47] is that possible? [18:11:55] 06Analytics-Kanban, 07Easy, 13Patch-For-Review: Don't accept data from automated bots in Event Logging - https://phabricator.wikimedia.org/T67508#3305206 (10Nuria) >Note that even if EventLogging::logEvent would forward the user agent (which it currently doesn't) To recap from IRC. Server side events DO forw... [18:13:32] ottomata: mmm let me look at your fancy pants patch again [18:13:59] cc fdans [18:14:53] nuria_, ottomata: that assumes that anything that has mediawiki on it should be whitelisted, is that what we want? [18:15:16] fdans: we can do that for now [18:15:24] fdans: is a lot better than whitelisting schemas [18:15:50] ooook :) [18:16:53] 06Analytics-Kanban: Evaluate swiv and see whether outstanding pivot bugs are fixed - https://phabricator.wikimedia.org/T166320#3305218 (10mforns) He team, I must retract from what I said in Stand-up: The legend bugs have **not** been fixed. Looking at the latest changes in the repo, I couldn't find anything rel... [18:18:10] 10Analytics: Update pivot with swiv clone - https://phabricator.wikimedia.org/T166689#3304286 (10mforns) Discovered that the legend bug was actually **not** fixed (see T166320), so... Do we still want to do this? The only advantage is the autosource schema feature that displays the zero field in pageview_hourly. [18:18:43] fdan, ottomata : i think adding whitelisting here is pretty easy: https://gerrit.wikimedia.org/r/#/c/350234/14/eventlogging/utils.py [18:19:15] nuria_: that would be ideal yeah [18:19:26] so, just remove MediaWiki [18:19:26] ? [18:19:33] from that regex? [18:20:00] this implies that anything carrying MediaWiki on its UA string shouldn't be considered a bot right nuria_ ? [18:20:45] elukey: so, re. profliel hiera stuff [18:20:47] what about statsd? [18:20:47] it implies that anything with mediawiki on UA is going to get logged [18:20:54] anytime i need the statsd host in a proflie [18:21:01] i need to set it as a class hiera parameter? [18:21:06] and then define it in every role? [18:21:07] oh no [18:21:09] fdans: but since we are logging parsed UAs [18:21:11] i dont' need to define it in a role [18:21:14] since it is defined globally [18:21:16] duh duh [18:21:16] sorry [18:21:25] nuria_: but that isn't that function's job right? [18:22:15] like, EL shouldn't know whether or not we are logging stuff in mysql [18:22:27] (sorry if I'm missing something) [18:23:30] cc ottomata [18:23:35] 10Analytics, 06cloud-services-team: Remove logging from labs for schema https://meta.wikimedia.org/wiki/Schema:CommandInvocation - https://phabricator.wikimedia.org/T166712#3305274 (10Nuria) [18:24:22] 10Analytics, 06cloud-services-team: Remove logging from labs for schema https://meta.wikimedia.org/wiki/Schema:CommandInvocation - https://phabricator.wikimedia.org/T166712#3305259 (10Nuria) See: https://phabricator.wikimedia.org/T123444 [18:24:47] fdans, ottomata ; ok, the other schema is not used anymore https://phabricator.wikimedia.org/T166712 [18:24:57] sorry fdans missed your point earlier [18:25:19] fdans: where you think the exclusion fits better? [18:25:30] fdans: true, but maybe we just don't want to consider mediawiki as a 'bot' user agent? [18:25:41] yeah there's my point ottomata [18:26:07] fdans: if we don't want to consider mw as a bot, then we can remove it from that regex [18:26:11] if we are making an exception on mw, but we still consider it a bot, this should be in puppet [18:26:16] exactly [18:26:29] if we are just doing this so that we can make a special exclusion for this mysql thing, then yeah, it goes in the filter plugin function [18:26:31] yeah [18:26:33] agree [18:26:41] 10Analytics: Update pivot with swiv clone - https://phabricator.wikimedia.org/T166689#3305328 (10Nuria) I would say no then. We can scrape this work and focus on setting up superset for PMs to take a look. [18:27:15] fdans: that seems fine and dandy [18:29:17] nuria_: forgot to say, since I was wrong about how MediaWiki passes the user agent, I was probably wrong about the UA for MediaWikiPingBack being MediaWiki* as well [18:29:26] tgr: jajaja [18:29:31] chances are it's 'PHP' or something like that [18:29:34] cc ottomata fdans [18:29:42] let me look [18:31:12] tgr: [18:31:15] https://www.irccloud.com/pastebin/GhwP5Eqw/ [18:31:24] not much going on on this schema [18:31:54] tgr: since 201703 [18:32:05] mmmmm [18:32:23] tgr: maybe that speaks of another error on our end [18:33:33] nuria_: seemed fine to me: https://phabricator.wikimedia.org/T67508#3304898 [18:34:05] nuria_: is that the right revision for that schema? [18:34:29] noooo [18:34:30] duh [18:34:32] https://www.irccloud.com/pastebin/6vCv3QDP/ [18:34:48] :) [18:35:36] tgr, fdans : ya, that makes sense, UA is (since we pre-parse it now) "other" [18:36:29] also last entry is from the 24th, which may be just before we deployed the filter? [18:36:37] nuria_ ottomata ? [18:36:50] fdans: ya, but that makes sense [18:37:00] fdans: right? [18:37:08] yeah yeah, I had a miniscare for a second [18:37:14] fdans: we would not expect events to have been inserted [18:37:20] thinking it was letting is_bot: true events in [18:38:26] tgr: so, for this sche,ma issues are tow fold: [18:38:47] 1) UA is lost (as when it is preprocessed it will appear as "other") [18:39:03] 2) as of last deploy (and this we need to fix ASAP) events are not being inserted [18:39:24] tgr: 2) s clearly an issue [18:39:58] we could retrieve "lost" events from hadoop right? [18:48:16] fdans: ya, there is no data loss [18:48:30] fdans: let's fix 2) and talk with trg to see if 1) is also an issue [18:50:03] nuria_: if you have a minute, can you please proof-read / update the 2 portions I added to https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Unique_Devices/Last_access_solution [18:50:29] yeah sooooo since we lose info on mediawiki by the time it reaches the consumer... right now we can only change EL, where we can: [18:50:52] a) remove MediaWiki from the regex, thus considering MW not a bot [18:51:11] fdans: i though we agrred to remove it at filter level [18:51:25] b) patch the parse function and set a parsed UA Map for mediawiki, and filter out in puppet [18:51:38] this would also fix issue number one you mentioned nuria_ [18:52:15] option b is better but unless I'm missing something it would require patching both EL and the plugin in puppet [18:52:19] cc ottomata [18:52:40] joal: looks good, i will just edit it a little bit [18:52:47] sure nuria_, feel fre [18:52:53] nuria_: we did but I wasn't considering that we only have the parsed UA at that stage [18:53:06] fdans: we lose info on mediawiki by the time it gets to consuemr? [18:53:13] because the original ua has been lost? [18:53:31] Done for tonight folks - see you tomorrow ! [18:53:32] yeah, uaparser knows not about this mediawiki business [18:53:33] the parsed UA won't have 'MediaWiki' as a value somewhere? [18:53:43] ottomata: no, it is a bot [18:53:48] ottomata: as it is a custom UA [18:53:49] that's right [18:54:01] ah, so the parsed ua will just ahve is_bot: true and that's it? [18:54:07] ottomata: ua parser only knows about "official" UAS andmajor bots like google or bing [18:54:13] and the generic stuff from uaparser [18:54:19] which will have what? [18:54:47] stuff like "Other", and "-" version numbers [18:56:08] this is why we have the regex as a secondary bot filter in the EL parser [18:56:20] at that point we still have the raw UA string [18:58:37] should we augment the ua parser values with stuff we know? [18:58:39] like, mediawiki? [18:58:43] dunno which field [18:58:43] but [18:58:53] browser: MediaWiki? [18:58:53] :) [19:00:53] let's first make sure that tgr needs that user agent [19:01:36] even if he doesn't need it we need to augment uaparser with MW if we want to whitelist it at the comsumer [19:02:04] well, we *could* still just whitelist that one schema [19:02:04] if we had to i guess [19:02:33] in tgr's schema: https://meta.wikimedia.org/wiki/Schema:MediaWikiPingback [19:02:42] ottomata: we could add another filter function for schemas ottomata [19:02:54] the mw version is part of the event thus I am not sure if UA is providing value [19:03:37] fdans: so maybe not classifying mediawiki as bot is going to be the easiest [19:04:08] nuria_: classifying mw would allow us to not have to look at specific schema names [19:04:08] oh yeah that's definitely the easiest [19:04:39] oh, you mean just yeah, saying all Mediawiki uas are not bot [19:04:43] i'm +1 for that i guess? [19:04:57] but even then it would mean that MediaWikiPingback_15781718 will get no info on UA on db tables [19:06:06] fdans: the only other solution i see is to add is_mediawiki to UA and on the filter filter on that [19:06:15] yeah but as you mentioned, the schema carries the mediawiki version? [19:06:20] so if is_bot and is_mediawiki are true we let things pass [19:07:13] ottomata, fdans : so 1) either adding an additional is_mediawiki to schema or 2) removing mediawiki from bot identification, those two are the easiest [19:07:18] I think [19:09:17] * fdans is thinking [19:09:56] if we don't really need is_mediawiki [19:09:59] then i prefer 2 [19:10:50] the is_mediawiki solution seems super specific for [19:11:21] between those two I prefer 1 [19:13:05] but to me the ideal would be to parse correctly mediawiki and set a whitelist in the filter [19:13:45] but I'm for just deleting mediawiki from the regex for now [19:16:44] parse correctly mediawiki? [19:17:58] fdans: not sure what you mean cc ottomata [19:18:38] sorry [19:19:06] I meant augment uaparser and have the json ua map carry info about mediawiki [19:19:08] nuria_: [19:20:13] fdans: ya, not super fond of that, we use ua parser so we can provide a consistent parsing in java/python.. and on my opinion it does it job, that request with a "custom" ua is a bot [19:20:29] fdans: it so happens that we do want data from those bots cause ther aye OUR bots [19:20:35] *they are OUR BOTS [19:20:38] sorry [19:20:40] no caps [19:20:55] yeah that makes sense [19:21:30] so i think the is_bot is correct and regex is correct, thus i favor adding a is_mediawiki [19:21:33] field [19:21:46] but please you can strongly disagree [19:24:27] no it makes sense, my only concern is this solution may not scale if we have any other special cases [19:25:38] if you feel this is a super exceptional case that we should let through, then it makes sense to me to add the is_mediawiki nuria_ ottomata [19:26:02] isn't there a field we can reuse? [19:26:07] instead of adding a new one? [19:26:25] ottomata: this would go inside the ua json map right? [19:28:42] yes [19:29:53] fdans, ottomata : Other special cases are going to have to sent Mediawiki on the UA to get through [19:30:01] fdans, ottomata : taht seems a fair requirements [19:35:53] fdans, ottomata : if that seems acceptable then let's add mediawiki flag [19:36:56] yeah that sounds good to me [19:40:59] fdans: are you looking for a field in the ua json map to re-use? [19:41:05] that would be a mediawiki flag [19:41:09] just without adding a special new field [19:42:11] naaah if we're changing the ua map we can just add that prop [19:42:24] guess sooooo [19:42:33] i'm not familiar with the ua fields [19:42:39] was just hoping there would be nice one to resuse [19:42:42] we already got is bot [19:42:43] that way if we don't have another case like this [19:42:47] we don't have to keep adding fields [19:42:55] we can just change the value of the one field [19:48:45] elukey: still around? [20:05:18] (03PS3) 10Nuria: Memoize host normalization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/356307 [20:16:51] (03PS4) 10Nuria: Memoize host normalization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/356307 [20:27:37] 06Analytics-Kanban, 13Patch-For-Review: Count global unique devices per top domain (like *.wikipedia.org) - https://phabricator.wikimedia.org/T143928#3305893 (10Nuria)