[00:30:22] milimetric: I forgot, what did you say was the replacement for limn? Like what powers https://analytics.wikimedia.org/dashboards/? [00:30:59] kaldari: Dashiki [00:31:05] yes, thanks! [00:31:07] docs on wikitech [00:31:15] awesome! [01:35:50] 10Analytics-Kanban, 10Discovery-Analysis, 10Interactive-Sprint: No maps tile requests in webrequests table as of 1 June 2017 - https://phabricator.wikimedia.org/T167083#3317096 (10mpopov) [07:00:09] 10Analytics-Kanban, 10DBA, 10Operations, 10ops-eqiad, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3317309 (10elukey) >>! In T166141#3315357, @jcrespo wrote: > Not really, we have almost decided the goals for Q1, and they are all quite urgent and for hardware that ha... [07:10:47] 10Analytics-Kanban, 10DBA, 10Operations, 10ops-eqiad, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3317327 (10Marostegui) >>! In T166141#3317309, @elukey wrote: >>>! In T166141#3315357, @jcrespo wrote: >> Not really, we have almost decided the goals for Q1, and they... [08:14:25] 10Analytics-Kanban, 10DBA, 10Operations, 10ops-eqiad, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3317379 (10elukey) Sure I am concerned too, this is why I asked if it was possible to order the hardware as soon as possible to be ready to work on it by the end of Q1 :) [08:29:46] joal: morning! [08:29:56] I might have merged a change without too much caffeine [08:29:57] https://gerrit.wikimedia.org/r/#/c/357315/3/modules/pivot/templates/config.yaml.erb [08:30:29] Daily loading is correct right? [09:45:51] * elukey looks for the passzorz for the local mysql db on EL beta [09:48:42] found in https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster [09:52:22] 10Analytics-Tech-community-metrics, 10Developer-Relations (Apr-Jun 2017): Automatically sync mediawiki-identities/wikimedia-affiliations.json DB dump file with the data available on wikimedia.biterg.io - https://phabricator.wikimedia.org/T157898#3317560 (10Albertinisg) >>! In T157898#3316525, @Aklapper wrote:... [10:23:37] INFO: line 95: (DRY-RUN) Executing command: DELETE FROM `Analytics_13317883_15423246` WHERE timestamp >= %(start_ts)s AND timestamp < %(end_ts)s LIMIT %(batch_size)s with params: {'batch_size': 1000, 'end_ts': '20170308102302', 'start_ts': '20170307102302'} [10:23:41] * elukey dances [10:27:21] 10Analytics, 10Analytics-EventLogging, 10DBA: db1047 has been restarted - needs another restart - https://phabricator.wikimedia.org/T166452#3317854 (10Marostegui) 05Open>03Resolved The scope of this ticket is done - pending is the ALTER table to unify revision so we can run pt-table-checksum for enwiki o... [10:51:07] * elukey lunch! [11:12:55] 10Analytics, 10Analytics-Dashiki, 10Wikimedia-log-errors: Warning: JsonConfig: Invalid $wgJsonConfigModels['JsonConfig.Dashiki'] array value, 'class' not found - https://phabricator.wikimedia.org/T166335#3318098 (10hashar) [11:12:57] 10Analytics: Invalid $wgJsonConfigModels['JsonConfig.Dashiki'] array value, 'class' not found - https://phabricator.wikimedia.org/T167054#3318100 (10hashar) [12:13:01] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jul-Sep 2017): Automatically sync mediawiki-identities/wikimedia-affiliations.json DB dump file with the data available on wikimedia.biterg.io - https://phabricator.wikimedia.org/T157898#3318322 (10Aklapper) @albertinisg: Awesome! Thank you a lot! I am... [12:17:16] Hey, I am preparing to run an R script on stat1002 that parses the pageview hourly dumps from /mnt/hdfs/wmf/data/archive/pageview/legacy/hourly/. I will be using all 16 cores for some {parallel} regex calls. Q: do I need to notify someone about this, since the script could be taking a while to complete? [12:23:01] GoranSM: I'd only use nice/niceio to avoid a complete meltdown of the host :) [12:23:09] for how long the script will run? [12:30:36] elukey: sincerely, I do not know. I am doing this for the first time. I need to parse the may 2017 and june 2017 pageview hourly dumps and simply count hits for a page. [12:31:43] GoranSM: curiosity - why don't you use the pageview api [12:31:44] ? [12:33:23] elukey: Because I've already wrote the script. And I am waiting for an approval for analytics-privateuser-data (I think that's the group), so next step HiveQL and no parsing in R. [12:34:21] elukey: I mean: the idea is to do this and similar things on pageviews and wmf.webrequests from Hadoop. But not now. Now I need this data from dumps. [12:36:10] GoranSM: the pageview api offers only daily and monthly views for a specific article though, I mentioned because it is much quicker than parsing data [12:37:13] mforns: o/ [12:37:19] elukey, hi! [12:37:33] elukey: I don't think that I really need the hourly data in this case. You see, maybe I could use the API after all. [12:38:21] GoranSM: if you don't it will be immensely quicker :) [12:38:59] mforns: thanks for the review! The table parameter will be automatically quoted by pymysql if I put it in the paramters, and it will screw up the query.. long story short, it can't be added :( [12:39:09] mforns: but I ran the script in beta! [12:39:25] elukey, I see! [12:39:37] elukey, you have ze data? [12:39:48] mforns: in deployment-eventlogging03.eqiad.wmflabs:/home/elukey/cleaner.log there is a trace [12:39:50] and you tested it? [12:39:56] ok! [12:40:02] we can check together if you want [12:40:06] I used your whitelist [12:40:12] elukey, ok [12:40:31] elukey, BTW, where should the final whitelist go then? [12:40:59] I have to make some modifications to it, but I'd like to put the list in its final location [12:41:36] mforns: we can keep the existing code review, I'll amend it with the puppet code to deploy it [12:42:05] interestingly, in beta the script stopped for [12:42:06] INFO: line 107: Executing command DELETE FROM `toku_test1` WHERE timestamp >= %(start_ts)s [12:42:14] since that table does not have "timestamp" [12:42:21] !!! [12:42:26] but it failed gracefully [12:42:27] so good [12:42:30] cool [12:42:38] everything logged clearly [12:43:08] elukey, I don't have permits to look at your log [12:43:17] :'( [12:44:18] elukey, this might be a small problem, in the EL database, there are also tables that are not event tables I think... [12:44:20] 10Analytics-Kanban, 10Discovery-Analysis, 10Interactive-Sprint: No maps tile requests in webrequests table as of 1 June 2017 - https://phabricator.wikimedia.org/T167083#3318377 (10Gehel) This looks like it might be related to T164608#3248269. It looks like the webrequest_source has changed to "upload" and th... [12:44:20] mforns: now you have it in your home dir :) [12:44:28] thanks! [12:44:40] oh yes definitely, I like that the script stops when this happens [12:45:52] elukey, we could change the select from information schema, to get event tables only, maybe using a regexp? [12:46:22] mforns: there should be only event tables in the log db no? [12:46:29] I just checked that the toku db tables are empty [12:46:30] elukey, mmm not sure [12:46:31] I [12:46:32] I [12:46:34] argh [12:46:39] I'll remove them [12:46:39] xD [12:47:48] elukey, yes, EL db only has event tables, cool [12:47:56] checked [12:48:32] MariaDB [log]> select max(timestamp) from CentralAuth_5690875_15423246; [12:48:35] +----------------+ [12:48:38] | max(timestamp) | [12:48:40] +----------------+ [12:48:43] | 20170321155853 | [12:48:45] +----------------+ [12:48:48] \o/ [12:50:26] elukey, awesome! [12:50:36] I could't find any update logs though [12:50:53] mforns: yes this is weird.. I am going to check the whitelist [12:50:54] does this make sense? maybe you copied full-purge tables? [12:51:23] well this is beta, I am not sure what tables are in there [12:51:33] maybe there were only the ones not in the whitelist [12:51:55] elukey, makes sense [12:52:27] elukey, you want to pair in da cave and test that scenario? [12:53:52] mforns: lemme check the whitelist against the log to be sure [12:54:00] ok [12:56:40] mforns: as far as I can see the script did the right thing, we have only tables with _somenumber or _somenumber_somenumber [12:57:57] sure [12:58:13] ah wait a minute, in prod is the same [12:58:24] but we should test the update scenario no? [12:58:52] yes that one too, but now I am trying to understand why the whitelist does not mention any table with _number [12:59:09] was the parsing logic to whitelist the table prefix? [12:59:44] if so it is missing from the script :) [13:02:59] aaaah [13:03:34] I used https://gerrit.wikimedia.org/r/#/c/298721/6/files/mariadb/eventlogging_purging_whitelist.tsv as reference [13:03:38] elukey, yes, the first column in the white-list is the schema name without _nuber or _number_number suffix [13:04:50] the purging can not be tied to a version of the schema, otherwise, when the teams improve their schemas, they would have to add all their fields to the white-list again and again for each new schema version [13:05:26] so, yes, we need some changes... [13:06:07] the _somenumber is the version (revision) number [13:06:52] I believe it should be a minor change [13:07:07] we split the table column by "_" in the parsing fun [13:07:12] keeping only the table prefix [13:07:18] elukey, yes, and change the tests if needed [13:07:24] then we just re.match rather than == [13:07:29] definitely [13:07:32] I [13:07:38] xD [13:07:39] I'll look into it in a bit [13:07:46] last bug sigh [13:07:51] I thought we were ready [13:07:51] ok [13:07:59] good that we have tested it in beta :P [13:08:05] it was my fault [13:08:13] should have remembered [13:12:38] elukey, regarding location of the white-list, in the current patch it is in /files/mariadb/eventlogging_purging_whitelist.tsv [13:13:11] but the script is in /modules/role/files/mariadb/eventlogging_cleaner.py [13:13:28] should I move the white-list to /modules/role/files/mariadb/ ? [13:13:50] and how about renaming the whitelist to eventlogging_cleaner_whitelist.tsv [13:13:51] ? [13:15:06] mforns: ah yes role/etc.. is a better location [13:15:12] and +1 for the name [13:21:04] elukey, k [13:22:24] ahh mforn, this is the only bit to change [13:22:25] for table in tables: [13:22:25] if table not in whitelist: [13:32:58] for table in tables: [13:32:58] if table [re.match(table_prefix, table) for table_prefix in whitelist]: [13:33:02] this one seems to work --^ [13:33:15] err table in [13:33:28] mmm no, if [..] without table [13:33:40] if the list is empty is false, otherwise any matches will lead to a true [13:33:43] testing it [13:34:06] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3318500 (10Cmjohnson) @elukey new raid controllers for an1033 and 1039 are on-site. please let me know when you want to swap them out [13:34:18] elukey, looking [13:34:32] mmm no it needs a bit more but the idea is there :D [13:36:05] elukey, how about: [13:36:28] line 440: schema_name = table.split('_')[0] [13:36:46] line 441: if schema_name not in whitelist: [13:37:27] that's even better :) [13:37:48] then we would have to pass both the table and the schema_name [13:37:54] or both the schema name and the suffix [13:38:01] to purge/sanitize [13:38:35] mforns: why? [13:38:54] (need to work on some analytics nodes, will lag a bit) [13:39:22] I guess purge/sanitize need the schema name to parse the whitelist, and the full table name to execute the queries [13:39:41] we can pass the full name and repeat the split('_') insite purge/sanitize also... [13:39:48] *inside [13:43:27] mforns: ahhh since we use the whitelist, right [13:43:32] well should be a small change [13:43:32] yes [13:43:34] nothing big [13:43:36] okok [13:43:37] yea [13:43:42] will do it after this work on an hosts [13:43:58] np [13:44:24] I can do it as well when I finish webrequest tagging CR [13:52:00] nono I am almost done [13:56:24] mforns: a simple solution, that doesn't touch too much the code, is to simply calculate schema_prefix = table.split('_')[0] in the sanitize [13:56:33] wdyt? [13:56:58] passing it via argument might be redundant [13:57:06] elukey, yes, but it needs to be calculated in the main as well no? otherwise we don't know if the table should be purged or sanitized [13:58:37] (03CR) 10Mforns: UDF to tag requests (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/353287 (https://phabricator.wikimedia.org/T164021) (owner: 10Nuria) [13:59:04] mforns: sure [13:59:39] I also added a sanity check in the sanitize: if the table-prefix is not in the whitelist it just raise a RuntimeError [14:05:51] (03CR) 10Ottomata: UDF to tag requests (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/353287 (https://phabricator.wikimedia.org/T164021) (owner: 10Nuria) [14:07:09] elukey: Thanks for the API suggestion, the thing rocks. [14:07:16] \o/ [14:07:26] (03CR) 10Ottomata: UDF to tag requests (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/353287 (https://phabricator.wikimedia.org/T164021) (owner: 10Nuria) [14:15:43] 10Analytics-Kanban, 10Discovery-Analysis, 10Interactive-Sprint: No maps tile requests in webrequests table as of 1 June 2017 - https://phabricator.wikimedia.org/T167083#3318749 (10Nuria) Correct, webrequest_source="maps" no longer exists. [14:34:27] mforns: another interesting thing that we discussed in the past but never followed up on [14:34:30] /usr/lib/python3/dist-packages/pymysql/cursors.py:158: Warning: Column 'event_revID' cannot be null [14:34:52] elukey, uuuuuu [14:34:54] :/ [14:35:44] yeah.. we need to make sure that the fields are nullable [14:35:45] mh [14:35:55] and if not... [14:35:55] otherwise no bueno [14:35:59] xD [14:36:07] if not alter table :P [14:36:28] or we just keep them [14:37:05] elukey, was that a real example? do we have actual non-nullable fields in EL db? [14:38:15] the error msg above is for MobileWikiAppShareAFact_12588711_15423246 [14:38:41] mmmmm [14:39:17] in beta it is not null [14:39:37] yea, in the schema it is a required field [14:40:13] even on db1047 is non-nullable [14:40:51] (╯°□°)╯︵ ┻━┻ [14:40:57] ahhahahah [14:41:21] 10Analytics, 10Wikipedia-iOS-App-Backlog, 10Reading Epics (Analytics), 10Spike, 10iOS-app-v5.6.0-Goat-On-A-Train: Research and define initial technical requirements for app analytics - https://phabricator.wikimedia.org/T164801#3318870 (10Fjalapeno) [14:41:40] maaan, the only solution I see is non-null garbage value... which is not optimal at all [14:41:44] but... [14:43:48] can we use "batmat" please ? [14:43:51] *batman [14:43:55] ahahhaha [14:44:00] xD [14:44:09] or something like "NaNaNa" in all fields [14:44:18] a select * would become awesome [14:44:27] xD [14:44:32] oh man [14:45:24] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3318884 (10Cmjohnson) 05Open>03Resolved Replaced both bbu's Return shipping info Fedex 9612018 6911799 02034386 96112018 6911799 02034379 [14:45:33] we can discuss this after standup no? [14:47:20] mforns: okok [14:47:49] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3318897 (10elukey) 05Resolved>03Open [14:56:43] 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3318969 (10elukey) [15:00:29] 10Analytics-EventLogging, 10Analytics-Kanban: whitelist multimedia and upload wizard tables - https://phabricator.wikimedia.org/T166821#3308710 (10mforns) @JKatzWMF I added the schema fields that we discussed to the white-list. You can check it here: https://gerrit.wikimedia.org/r/#/c/298721/ The related chan... [15:04:34] a-team standup! [15:04:53] joal: ? [15:05:19] oh nm [15:18:26] 10Analytics-Kanban, 10Discovery-Analysis, 10Interactive-Sprint: No maps tile requests in webrequests table as of 1 June 2017 - https://phabricator.wikimedia.org/T167083#3319067 (10debt) @Nuria - how can we get this fixed? Is the data for the last 6-ish days lost? @MaxSem - is this something you can help us w... [15:24:33] !log restarting eventlogging processor to bring in is_mediawiki ua classification [15:24:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:31:11] o/ ottomata [15:31:16] hiyaa [15:31:19] so yeah, ask away! :) [15:33:16] FYI, the live systems meeting tomorrow is going to have some guests who are looking to use hadoop/hive for some cool stuff. [15:33:29] joal & milimetric: ^ [15:33:32] live systems meeting? [15:33:50] Yeah. I've just renamed it to "Large scale data analysis". [15:34:03] Started as a regular meeting to talk about infra for wikicredit. [15:34:11] It turned into a meeting to talk about live computation in general. [15:34:20] ottomata: interested in what happens when a request is made to foo.wikipedia.org/beacon/event?... [15:34:20] what's live computation mean? [15:34:28] And now I'd like to generalize to "How do I do this big analysis in (batch|realtime)?" [15:34:38] phuedx: ok!, as in, what receives that, and how does it get into the EL system? [15:34:39] in my head it's "varnish -> varnish kafka" [15:34:43] that's correct [15:34:45] ottomata, computation based on streams or streamish things [15:34:47] but which varnishes [15:34:54] halfak: you know i want to come to that [15:34:55] and are there layers of varnish? [15:34:59] invite me please! [15:35:08] and is there any processing before varnishkafka? [15:35:15] ottomata, done :) [15:35:28] phuedx: oook so [15:35:37] yeah there are layers of varnishes and different classes [15:35:49] the ones that respond to /beacon are the 'text' varnishes (which are used for most things) [15:35:58] there are a lot of them, and they are in all the different datacenters [15:36:10] there are a couple of layers for caching purposes [15:36:12] but that's not important here [15:36:26] all our logging is done from the frontend layer, since that is the only layer that gets all requests [15:36:42] so, varnish just responds to that request with whatever (201? don't remember) [15:36:47] but doesn't forward it anyway [15:36:59] there is a special varnishkafka instance that looks for requests to /beacon/event [15:37:09] (204) [15:37:24] any request there is produced to kafka in a JSON format that includes the uri_query params [15:37:34] so, there is a topic in kafka that contains all of those requests [15:38:15] brb [15:39:54] then, eventlogging (specifically eventlogging-processor) consumes from that topic, decodes the data in uri_query, [15:40:03] parses that as json [15:40:08] validates the event against the jsonschema [15:40:18] and then produces that validated JSON event to two different kafka topics [15:40:23] one, is called 'eventlogging-valid-mixed' [15:40:29] that contains ALL valid events from all schemas [15:40:38] the other, is eventlogging_ [15:40:43] so each schema gets its own Kafka topic [15:40:52] eventlogging-valid-mixed is really only used to import all events into MySQL [15:41:04] the eventlogging_ topics can be used for anything [15:41:08] but those are the ones that are imported into Hadoop [15:41:16] phuedx: make sense? [15:42:57] ottomata: yarrrp, that makes sense, thanks [15:44:13] !log restarting eventlogging mysql consumer to allow is_mediawiki events through is_not_bot filter [15:44:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:44:35] phuedx: are you working on the page create event thing? [15:44:55] oh no, that's the comtech team [15:44:59] yeah, whatcha working on? [15:45:08] ottomata: page previews [15:45:37] what's thta? [15:45:44] hovercards, popups [15:45:51] three different names for the same project ;) [15:45:55] ah ya, you trying to emit events for that? [15:47:03] ottomata: i'm trying to generate hypotheses for why we're seeing high levels of duplication in the events that we're logging [15:47:10] i want to believe that it's the client's fault (i.e. in the code that i wrote) [15:47:32] ahh [15:47:42] but it occurred to me (and tilman when i've been talking with him) that we don't really know much about the bit in between varnish and mysql/hadoop [15:47:48] phuedx: you can consume from your kafka topic directly [15:48:01] to confirm if the duplicate events are actually seen by varnish [15:48:09] if they are in the eventlogging_ topic, they were [15:48:29] you could also check in hadoop to see if they are duplicated there too [15:48:40] if they are there, as well as in mysql, you could be pretty sure they were seen by varnish [15:48:50] if just in mysql, then it is probably a problem with the mysql inserttion code [15:48:54] in eventlogging [15:49:36] 10Analytics-Kanban, 10Discovery-Analysis, 10Interactive-Sprint: No maps tile requests in webrequests table as of 1 June 2017 - https://phabricator.wikimedia.org/T167083#3319207 (10Gehel) My guess is that @mpopov probably knows enough to fix that request (now that he knows what has changed). If it is not the... [15:51:18] ottomata: we're seeing them in both mysql and hadoop, so we're confident that duplicates were added to kafka [15:51:24] again, just wasn't sure about the machinery in the middle [15:51:29] but it seems super-simple [15:52:51] the duplication presents itself very… oddly [15:52:56] it is possible for varnishkafka to produce duplicates [15:53:00] but, not often [15:53:14] and, it wouldn't just happen to your schema [15:53:18] it'd happen to all of them when it happens [15:53:20] firefox is the main offender [15:53:33] and by main, i mean like 2000 events a day [15:53:39] vs. 1 by chrome or safari [15:53:59] phuedx: if you want to super duper confirm, the el requests are also logged to the webrequest table [15:54:11] so, you could find your events in there as normal web access requests [15:54:18] that table has two extra fields [15:54:24] hostname and sequence [15:54:37] where hostname is the actual varnish server hostname that served the request [15:54:44] and sequence is the incrementing unique sequence id for that varnish instance [15:54:55] if you find your duplicate request and it has different hostname/sequence [15:55:03] then you know it was actually two different requests [15:55:07] to varnish [15:55:28] if they are the same, then it is varnishkafka's fault, since it produced the event more than once [15:55:40] ottomata: thanks for the tip! [15:55:43] we actually keep some stats on these types of events too [15:56:01] in the wmf_raw.webrequest_sequence_stats and wmf_raw.webrequest_sequence_stats_hourly tables [15:56:04] in hive [15:56:22] 10Analytics, 10Editing-Analysis: Bring the Editor Engagement Dashboard back - https://phabricator.wikimedia.org/T166877#3319222 (10Halibutt) Thanks @Nuria , especially the compare link is helpful. Does the second link show VE edits only, or is it both editors combined? Also, are there any time stats as well?... [15:56:43] for any given hour, you can see the number of missing or duplicate events [15:57:03] where duplicate here means same hostname/sequence [15:58:54] woah ottomata, that's awesome [15:59:12] we basically have a list of hypotheses [15:59:17] 10Analytics, 10Pageviews-API: Endpoint for average view rate in Pageview API - https://phabricator.wikimedia.org/T162933#3319234 (10Halfak) @Nettrom, I think this is our biggest blocker for getting the Item Quality model hosted in ORES. [15:59:19] and are testing 'em [15:59:45] that'd knock out the varnish->varnishkafka hypothesis quickly [16:09:49] so ottomata, if i had a unique id for an event, i could check if that event has been duplicated by querying those tables? [16:34:00] Hey halfak - Sorry I won't be there tomorrow, I'll be off as I was otday [16:34:00] 10Analytics-EventLogging, 10Analytics-Kanban: whitelist multimedia and upload wizard tables - https://phabricator.wikimedia.org/T166821#3319339 (10JKatzWMF) Thanks, @mforns! [16:34:32] joal, gotcha. Thanks for letting me know :D [16:34:37] * joal is sad to miss the guests [16:34:53] Guests will be sad to miss joal [16:37:39] 10Analytics, 10Analytics-EventLogging, 10Editing-Analysis, 10Wikimedia-Hackathon-2017, and 2 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3319348 (10dbarratt) It looks like we are already logging an event when [[ https:/... [16:39:52] 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3319369 (10elukey) Current status: ``` elukey@neodymium:~$ sudo cumin 'R:class = role::analytics_cluster::hadoop::worker' 'megacl... [16:46:12] hey ottomata, sorry for missing, was off today and so tomorrow [16:58:15] np sorry, didn't realize! [16:58:17] enjoy your time off [16:58:21] joal: tomorrow too? [16:58:27] so you aren't doing this researcher interview with me? [16:58:33] ottomata: yessir, but will be there for interview [16:58:36] ok cool [16:58:38] ;) [17:01:12] 10Analytics, 10Analytics-EventLogging, 10Editing-Analysis, 10Wikimedia-Hackathon-2017, and 2 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3319449 (10Ottomata) Please note that the same fields you all want have been discu... [17:10:55] ottomata: re the query i asked about earlier, i just checked the _hourly table for an hour that i know has duplicate events in it and it states that there's no duplicates so 👍 [17:12:58] oook great :) [17:13:07] gotta run for a bit to pick up my car from garage, bbib [17:13:21] o/ ottomata -- just saw https://phabricator.wikimedia.org/T150369#3319449 [17:13:34] maybe page-create should be its own event? Is it already? [17:14:04] I don't see something like that in https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki/page [17:16:20] 10Analytics, 10Analytics-EventLogging, 10Editing-Analysis, 10Wikimedia-Hackathon-2017, and 2 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3319489 (10kaldari) @Ottomata: Are you arguing against including this data in Even... [17:18:00] o/ kaldari [17:18:12] hello :) [17:18:14] Looks like we're both interested in talking to ottomata about this. [17:18:24] But he quit right before I pinged him :/ [17:18:29] So let's scheme without him :D [17:18:30] boo [17:18:45] Your Q re event logging is interesting to me too. [17:18:53] I think there should be a page-create event *somewhere* [17:18:59] WIth the metadata we want [17:19:17] my plan is to just add a few more fields to https://meta.wikimedia.org/wiki/Schema:PageCreation for now, which should be simple enough. I would love to have this data in EventBus and everywhere else too :) [17:19:29] what are the pieces of data that you need? [17:19:32] * halfak clicks [17:19:45] Aww. Lookie that! I made this thing. [17:20:03] The ones I want added are: [17:20:06] Edit count of the page creator [17:20:06] Age of the page creator account in days [17:20:06] Does the page creator have the autopatrol right? [17:20:06] Whether or not the page is a redirect [17:20:06] Initial size of the page [17:20:14] Probably while I was digging through page BS doing this study: https://meta.wikimedia.org/wiki/Research:Wikipedia_article_creation [17:20:36] kaldari, we should have something for rights too [17:20:48] Could have an editor who is manual-confirmed or a bot [17:21:26] maybe an array of all the user's rights? [17:21:50] or all of their user groups? [17:21:52] or both? [17:22:38] Hmm... rights/groups seems like a good candidate for a slow-moving record [17:22:47] milimetric had some jargon for things like this. [17:23:05] would this qualify as a slow-moving record? [17:23:28] https://en.wikipedia.org/wiki/Slowly_changing_dimension [17:23:56] Yeah. This thing. I know that many of the analytics folks have been thinking about the best way to track these kinds of things. [17:24:12] That was a while ago so maybe there's something that would work for it now? [17:24:26] I suppose edit count at time of creation isn't slow changing [17:24:38] So we'll want a field for that too [17:24:53] yeah, definitely edit count and account age [17:25:13] kaldari, account age should be a simple join [17:25:18] edit count is computationally complex [17:28:31] well, it not quite just a join, but definitely easier than edit count to retrieve. [17:28:43] join + date math? [17:28:50] date math is O(0) [17:28:50] yeah [17:29:14] what's a join? [17:29:29] also O(0)? [17:30:35] Sorry yeah. big-O notation. O(0) is constant time (doesn't count). A join is O(log n). Combining page creations with the `user` table is a "join" [17:30:37] do we have space constraints on stats1002? [17:30:54] We do have space constraints in the mysql database [17:30:58] technically not on stat2 [17:31:18] oh yeah, that's just the gateway, huh? [17:31:27] which database server does it live on? [17:32:09] 10Analytics-Kanban: Modify EventLogging so that all table fields are nullable - https://phabricator.wikimedia.org/T167161#3319587 (10mforns) [17:33:03] whatever analytics-store.eqiad.wmnet points to. I know that the event logging DB ("log") has had IO and space issues in the past [17:33:16] halfak: What do you think of "Whether or not the page is a redirect" and "Initial size of the page"? [17:33:43] +1 [17:33:49] that's good to know, I guess we shouldn't spam it with denormalized data then :P [17:34:17] so that just leave the user rights/groups to figure out [17:34:29] If we can route this directly to HDFS, then I'd side with lots of denormalization [17:34:37] +1 [17:37:18] 10Analytics-Kanban: Make non-nullable columns in EL database nullable - https://phabricator.wikimedia.org/T167162#3319610 (10mforns) [17:39:37] halfak: Personally, I favor identifying the relevant user rights and setting booleans for them in the schema (to keep things simple). Most rights are not that relevant for page creation events. [17:40:18] I would say autoconfirmed, autopatrolled, bot... [17:40:54] kaldari, won't be future proof. [17:41:02] nothing is futureproof :) [17:41:19] Honestly having a field where all user_groups are concatenated would work for most queries. [17:41:40] *futurerobust [17:41:42] but it would be slow [17:42:22] we could also do both [17:43:04] maybe booleans for the 3 user rights listed above, and then a field to store all the user groups [17:43:52] booleans are cheap [17:44:34] that would also help us in the case that the user rights for the groups are changed [17:49:28] halfak: gotta run to a meeting a few minutes. Any thoughts on the proposal above? hopefully more futureproof :) [17:49:47] kaldari, checking a boolean and running a string match are both pretty damn fast [17:50:19] Make an edit to the Schema page with what you think is right (even if you disagree with me) and lets go from there. [17:50:50] I'll push on ottomata re. a page-creation event and the data lake the next time I see him (might be early tomorrow AM if he's off for the day) [17:50:57] cheers! [17:52:46] sounds good to me! [18:09:41] 10Analytics, 10Analytics-EventLogging, 10Editing-Analysis, 10Wikimedia-Hackathon-2017, and 2 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3319764 (10kaldari) @dbarratt: After discussing with Halfak, here are the new fiel... [18:16:50] * elukey off! [18:21:44] (back) [18:25:06] 10Analytics, 10EventBus, 10ORES, 10Scoring-platform-team-Backlog: Emit revision-score event to EventBus and expose in EventStreams - https://phabricator.wikimedia.org/T167180#3319963 (10Ottomata) [18:25:33] 10Analytics, 10EventBus, 10ORES, 10Scoring-platform-team-Backlog: Emit revision-score event to EventBus and expose in EventStreams - https://phabricator.wikimedia.org/T167180#3319973 (10Ottomata) [18:25:53] 10Analytics, 10EventBus, 10ORES, 10Scoring-platform-team-Backlog: Emit revision-score event to EventBus and expose in EventStreams - https://phabricator.wikimedia.org/T167180#3319947 (10Ottomata) [18:33:52] 10Analytics, 10Analytics-EventLogging, 10Editing-Analysis, 10Wikimedia-Hackathon-2017, and 2 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3319988 (10Ottomata) > Are you arguing against including this data in EventLogging... [19:22:30] 10Analytics-Kanban, 10Discovery-Analysis, 10Interactive-Sprint, 10Patch-For-Review: No maps tile requests in webrequests table as of 1 June 2017 - https://phabricator.wikimedia.org/T167083#3320268 (10mpopov) 05Open>03Resolved a:03mpopov Thanks for finding the source of the problem, @Gehel! @debt: th... [19:34:17] 10Analytics-Kanban, 10Discovery-Analysis, 10Interactive-Sprint, 10Patch-For-Review: No maps tile requests in webrequests table as of 1 June 2017 - https://phabricator.wikimedia.org/T167083#3320360 (10debt) Woohoo! Thanks, @mpopov and @Gehel ! [19:40:15] 10Analytics, 10EventBus, 10ORES, 10Scoring-platform-team-Backlog: Emit revision-score event to EventBus and expose in EventStreams - https://phabricator.wikimedia.org/T167180#3320370 (10Halfak) https://gist.github.com/halfak/10183548a4d754935481b9bddf9544e8 [19:41:08] 10Analytics, 10EventBus, 10ORES, 10Scoring-platform-team: Emit revision-score event to EventBus and expose in EventStreams - https://phabricator.wikimedia.org/T167180#3320373 (10Halfak) [20:30:59] 10Analytics, 10Analytics-EventLogging, 10Editing-Analysis, 10Wikimedia-Hackathon-2017, and 2 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3320601 (10kaldari) @Ottomata: That sounds like a great idea. One impediment I wan... [20:31:16] 10Analytics: Requesting access to the nda LDAP group for GoranSMilovanovic - https://phabricator.wikimedia.org/T167199#3320602 (10GoranSMilovanovic) [20:34:59] 10Analytics, 10Pageviews-API, 10ProofreadPage: API: image thumb-url for ProofreadPages - https://phabricator.wikimedia.org/T167200#3320626 (10Mpaa) [20:41:05] 10Analytics, 10LDAP-Access-Requests: Requesting access to the nda LDAP group for GoranSMilovanovic - https://phabricator.wikimedia.org/T167199#3320760 (10Peachey88) [21:19:36] 10Analytics, 10Community-Tech: Investigation: How can we improve the speed of the popular pages bot - https://phabricator.wikimedia.org/T164178#3321123 (10DannyH) [21:28:21] 10Analytics, 10Analytics-EventLogging, 10Editing-Analysis, 10Wikimedia-Hackathon-2017, and 2 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3321176 (10Ottomata) Innntersting, ok, I will have a go at docs, I think you are r... [21:56:37] 10Analytics, 10Analytics-EventLogging, 10Editing-Analysis, 10Wikimedia-Hackathon-2017, and 2 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3321293 (10kaldari) @Ottomata: Thanks! I think the main things that are missing ar... [23:53:13] 10Analytics, 10LDAP-Access-Requests: Requesting access to the nda LDAP group for GoranSMilovanovic - https://phabricator.wikimedia.org/T167199#3321742 (10Nuria) @GoranSMilovanovic I think this ticket is a duplicate of your other request for access? [23:56:00] 10Analytics, 10Analytics-EventLogging, 10Editing-Analysis, 10Wikimedia-Hackathon-2017, and 2 others: Record an EventLogging event every time a new content namespace page is created - https://phabricator.wikimedia.org/T150369#3321762 (10Nuria) @ottomata: let's work on updating docs, could we get an initial... [23:59:23] 10Analytics, 10Pageviews-API: Endpoint for average view rate in Pageview API - https://phabricator.wikimedia.org/T162933#3321772 (10Nuria) >I'm a WikiProject maintainer and I want to sort my worklists by the articles view rate, I cannot really do this well through pageview API as i need to request pageviews fo...