[00:15:49] Analytics-Backlog, Analytics-EventLogging, MediaWiki-extensions-CentralNotice, Traffic, operations: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1879645 (BBlack) The JSON payload size isn't what's important here, just the 2K URL limit.... [00:17:58] !log deployed new version of CirrusSearchRequestSet schema (111448028943) to mediawiki [00:20:22] Analytics-Backlog, Analytics-EventLogging, MediaWiki-extensions-CentralNotice, Traffic, operations: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1879655 (Ottomata) A sane solution would be at accept events via POST instead of encoded q... [00:21:29] * ebernhardson wonders if !log worked [00:23:47] ottomata: it seems i'm not allowed to look at the camus-mediawiki logs to see if everything works out ok(via yarn logs). Can you check after the next run? [00:24:00] or i suppose i can just check the record count [00:25:20] ebernhardson: how were you checking? it may need sudo -u hdfs if it's being run by the hdfs user [00:25:37] logs are here - https://tools.wmflabs.org/sal/analytics [00:26:32] madhuvishy: well i'm not entirely sure how to check, so i thought i would start by looking at the log output :) [00:26:53] it's just deployed, and it looks like camus-mediawiki ran just before the deployment, so i think i need to wait 15 minutes anyways [00:27:08] ebernhardson: aah, is it being run through oozie? [00:27:28] madhuvishy: tbh not entirely sure, but that's the impression i was under [00:28:05] ebernhardson: what does it do? sorry just trying to understand what camus-mediawiki is and how it [00:28:14] is run so i can point you to right logs [00:28:45] madhuvishy: oh sorry, camus mediawiki reads the mediawiki_CirrusSearchRequestSet channel from kafka and sends it to hdfs, faik [00:28:59] s/channel/topic/ [00:29:26] ebernhardson: this? https://hue.wikimedia.org/oozie/list_oozie_coordinator/0113952-150922143436497-oozie-oozi-C/ [00:30:13] madhuvishy: i think that one just creates the partition, sec checking [00:30:51] madhuvishy: yea that one just adds the partition, a different one runs camus and brings the data in [00:31:06] madhuvishy: camus is not run via oozie [00:31:09] (not working right now!) [00:32:05] (CR) Nuria: "Thanks for grabbing the changeset. I think the whole patch needs to be updated for javadoc to match the convention of the other UDFs." (7 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247601 (https://phabricator.wikimedia.org/T115919) (owner: OliverKeyes) [00:33:47] madhuvishy: camus logs are on analytics1027 in /var/log/camus [00:33:53] byeyeyeye (am with people) [00:34:06] ahh thanks, i was digging through puppet and would have found that eventually :) [00:34:18] ebernhardson: ^ can you raead those logs? :) if not i can check for you [00:34:34] madhuvishy: nope, not allowed into 1027 [00:35:22] actually it looks like camus mediawiki only runs once an hour [00:35:34] so have to wait until 5:15 sf time for it to trigger again [00:36:48] although perhaps the last one did fail, since the add partition job didn't find an _IMPORTED tag [00:38:25] ebernhardson: i can get you the 4.15 log [00:38:45] madhuvishy: that will help, thanks [00:42:16] ebernhardson: this is where it's at [00:42:20] https://www.irccloud.com/pastebin/oSa2RFUR/ [00:45:27] hmm, everything there seems happy but no _IMPORTED tag :S checking some other things [00:58:08] madhuvishy: it looks like camus-checker job was supposed to add the _IMPORTED flag, but the only output in that log file about it is about finding multiple SLF4J bindings. would those errors be indicating that it didn't run at all? seems odd though because the deployment hasn't changed [00:59:15] ebernhardson: i'm a bit confused too - i don't think the slf4j warnings mean anything - but it doesn't say the camus-checker job succeeded [00:59:57] any chance it got stuck and is still running in a loop or something? [01:00:05] Analytics-Backlog, Analytics-EventLogging, MediaWiki-extensions-CentralNotice, Traffic, operations: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1879772 (Tgr) @BBlack, yes, which is why I am saying something should be done to improve c... [01:00:11] the cmdline refers to org.wikimedia.analtics.refinery.job.CamusPartitionChecker [01:01:26] * madhuvishy looks [01:01:48] and yea i read the slf4j logs, that's just a warning [01:01:52] s/logs/docs/ [01:05:42] ebernhardson: this is the wrapper script that's launching the camus jobs [01:05:42] https://github.com/wikimedia/analytics-refinery/blob/7c8f5df7b40e62abbdf289e8ea738fc277744ab9/bin/camus [01:06:02] you can see it take --check as an arg, and if set launches the CamusPartitionChecker [01:07:26] that all seems to go fine [01:07:56] yea it seems like it, wonder whats not working :S will see what it does when the import for camus runs again in ~10 minutes i guess [01:08:18] yup, i'm tailing the log [01:19:14] hmm, looks like the job ran but nothing output. I think camus is having trouble picking up the new schema :S [01:19:38] must have been a couple records at the end of the previous run that tripped it up as well [01:26:20] i'll revert mediawiki back to sending the old style events and make a ticket for david/nuria to look into it since they are most familiar with the camus end of the code [01:34:32] !log reverted cirrus avro schema in mediawiki back to 101446746400 due to camus not reading in the new events [02:29:17] Analytics, CirrusSearch, Discovery: Camus not reading in CirrusSearchRequestSet events with the new schema identifier and schema version - https://phabricator.wikimedia.org/T121483#1879953 (EBernhardson) NEW [02:37:28] Analytics, CirrusSearch, Discovery: Camus not reading in CirrusSearchRequestSet events with the new schema identifier and schema version - https://phabricator.wikimedia.org/T121483#1879964 (EBernhardson) I extracted a couple messages from kafka into my homedir on stat1002, /home/ebernhardson/csrs.-500... [02:45:11] madhuvishy: sorry I got pulled away in real life. If you get just one target working with the dashiki deployer I can easily add the other ones. But I'd change the format you have to not include the gulp command. Like I'd just have { layout: ..., config: ..., piwikHost: ..., piwikId: ..., hostname: ..., subfolder: ... } [02:45:31] that way we can change the gulp command or even build system and only change one place in fab [02:57:48] milimetric: ya okay :) that was my plan too [04:36:29] Analytics, CirrusSearch, Discovery: Camus not reading in CirrusSearchRequestSet events with the new schema identifier and schema version - https://phabricator.wikimedia.org/T121483#1880013 (EBernhardson) [04:38:26] Analytics, CirrusSearch, Discovery: Camus not reading in CirrusSearchRequestSet events with the new schema identifier and schema version - https://phabricator.wikimedia.org/T121483#1879953 (EBernhardson) Reverting the schema in mediawiki doesn't seem to have helped anything move along, the pipeline se... [08:28:30] stat1002 seems to be dead :/ [08:34:50] CPR! [08:35:41] join [09:35:45] hey dcausse, works for me [09:37:00] joal: moritzm remouted the hdfs mount [09:38:10] joal: do you have 5 min for (questions concerning camus :) [10:00:04] jynus, hi [10:00:07] hi [10:00:23] how are you? :] [10:00:31] very well [10:00:33] let me commit a couple of pending patches now [10:00:44] and we can start the downtime [10:00:54] jynus, ok [10:00:57] if you have to stop something, you can do it now [10:01:13] cool, I'll stop the mysql consumers [10:01:22] the patches may impact the proper working of that [10:01:28] ok [10:02:10] Analytics-Tech-community-metrics: gerrit_review_queue can have incorrect data about patchsets "waiting for review" - https://phabricator.wikimedia.org/T121495#1880244 (Aklapper) NEW [10:06:15] see: https://gerrit.wikimedia.org/r/#/c/259222/ [10:07:36] aha [10:07:53] and https://gerrit.wikimedia.org/r/#/c/240043/ [10:08:02] (CR) Joal: "I agree with Nuria inline comments." [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247601 (https://phabricator.wikimedia.org/T115919) (owner: OliverKeyes) [10:10:01] hey dcausse, sorry, yes, I have time (i was reading emails) [10:10:11] Analytics, CirrusSearch, Discovery: Camus not reading in CirrusSearchRequestSet events with the new schema identifier and schema version - https://phabricator.wikimedia.org/T121483#1880258 (dcausse) I have no clue :/ I found camus logs sometimes useless, moreover if an error occurs in the mappers (I... [10:10:34] joal: we ahve troubles with camus since we deployed a schema update :/ [10:11:02] Erik ran some debug tools to make sure that everything in kafka complies to the format we expect [10:11:23] camus logs seem to be fine [10:11:34] jynus, I'm trying to disable the mysql consumers, but it's taking longer than I expected [10:11:51] why? [10:12:11] as in, what is it blocked on? [10:12:14] dcausse: looking at camus log right now: task error: Error: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.ClassCastException: java.util.HashMap cannot be cast to java.util.Collection at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:296) [10:12:36] hu? [10:12:39] dcausse: maaaaaaaaany of them [10:12:44] jynus, well I never did that before, I thought the eventloggingctl script would permit that, but no [10:12:59] I'll have to change the configs and restart EL [10:13:06] it may take 10 mins [10:13:14] 10 is ok, it can wait [10:13:21] joal: you have one full stack? [10:14:03] sure I do [10:14:39] dcausse: https://gist.github.com/jobar/c6eb4cc4a1eef31171a8 [10:14:53] joal: thanks [10:14:57] np [10:16:50] dcausse: does that make sense to you ? [10:19:46] joal, yt? [10:20:28] yup [10:20:32] joal: not yet... [10:20:34] :] [10:24:13] mforns ? [10:24:41] Analytics, CirrusSearch, Discovery: Camus not reading in CirrusSearchRequestSet events with the new schema identifier and schema version - https://phabricator.wikimedia.org/T121483#1880329 (dcausse) Joal extracted some logs: ``` task error: Error: org.apache.avro.file.DataFileWriter$AppendWriteExcept... [10:25:30] Quarry, Labs: 502 Bad Gateway on HTTP-requests to quarry.wmflabs.org - https://phabricator.wikimedia.org/T121502#1880330 (Stigmj) NEW [10:25:39] jynus, still working :/ [10:32:49] mforns, this is blocking all operations deployments, if it is going to take more time, I would cancel the maintenance, schedule it for another time [10:33:04] jynus, understand [10:33:10] or, I can just restart mysql [10:33:21] and kill the connections [10:34:24] jynus, I will stop all the system, we will loose the events coming from the server side, but it may be ok [10:35:03] jynus, done [10:35:16] ok, aplyting the changes and restarting [10:35:21] thx [10:37:46] !log Stopped EventLogging server for database maintenance [10:40:03] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/server-side-events-log consumer/mysql-m4-master-03 consumer/mysql-m4-master-02 consumer/mysql-m4-master-01 consumer/mysql-m4-master-00 consumer/client-side-events-log consumer/all-events-log processor/server-side-0 processor/client-side-11 processor/client-side-10 processor/client-side-09 processor/client-side- [10:40:09] if everithing goes well, upgrading reconfiguring and rebooting will take 10 minutes [10:41:27] jynus, cool [10:41:40] jynus, after that I can restart EL? [10:42:12] yes, but wait for now [10:42:19] sure [10:43:11] joal: I think it's when the schema change in the kafka stream, the AvroWriter keep track of the first schema but the DatumWriter is never updated and keep the old schema [10:43:33] I think I've added a property that would allow us to force to use the latest schema [10:46:33] Analytics, CirrusSearch, Discovery: Camus not reading in CirrusSearchRequestSet events with the new schema identifier and schema version - https://phabricator.wikimedia.org/T121483#1880410 (dcausse) When camus generates a new file it will generate an avro DatumWriter with the first message schema, but... [10:47:05] Phase 5/6: Checking and upgrading tables [10:47:18] ok [10:51:41] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 46.67% of data above the critical threshold [30.0] [10:53:06] mforns, you can activate it again [10:53:32] jynus, cool [10:54:59] jynus, done [10:56:13] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [10:57:47] we haven't finished yet, we have to check that everithing is all right [10:57:58] jynus, sure [10:58:32] I'm following the logs and metrics on EL server side [10:59:26] doing (something similar) at db side [11:00:05] jynus, I see insertions happening [11:00:26] great, it will take me a bit more to me to chek that [11:00:50] as there is some lag towards the analytics slave [11:05:18] my checks are getting stuck on the table after log.GuidedTourButtonClick_13869649 [11:05:45] jynus, I don't understand [11:07:44] I think there is some issue with GuidedTourButtonClick_8221559 [11:08:22] mm [11:08:33] are you writing there? [11:08:58] let me grep the logs [11:10:00] jynus, yes, events are being inserted for GuidedTourButtonClick [11:10:02] let me disabled the event scheduler, it will have restarted again [11:10:18] ok [11:11:04] ok, so things will go very slow for now [11:11:28] I've seen in the logs that insertions are slow, aha [11:11:41] but that is because the buffers are cold [11:11:47] aha [11:11:57] it may take a few hours to get to speed [11:12:23] I propose you to end the maintenance now, and evaluate in some hours' time (with you or with someone else) [11:12:48] I think that's OK, Kafka will buffer the events and EL's mysql writers adapt to the database speed [11:13:02] if you agree, can you send an email indicating the end of maintenance to anlytics? [11:13:11] I will update the tickets [11:13:12] jynus, sure [11:14:07] I'll be looking at this for a couple hours now [11:15:45] cold buffers?! [11:15:49] * milimetric puts a coat on the buffers [11:16:00] actually, I did that [11:16:09] :) [11:16:20] if it has to be restarted again, buffer will now be warmed automatically [11:16:26] but I needed a restart for that [11:16:29] :-) [11:17:23] Analytics-Kanban, DBA: 2 hour outage to update mysql on EL slaves - https://phabricator.wikimedia.org/T121120#1880437 (jcrespo) Maintenance seems to have been done correctly, I will keep this ticket open for a few hours to check some minor issues, mostly to check that things are going correctly. But for... [11:18:30] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1880441 (jcrespo) MySQL at db1046 has been upgraded and reconfigured: T121120. [11:19:22] dcausse: sorry, had to help a bit on eventlogging :) [11:20:02] dcausse: to my understanding, avro needs the schema version that has written the event in order to decode it (even if it is to be used with an older version [11:20:08] is that right ? [11:20:31] mforns, it seems to work now, it was just that it was blocked by the writes [11:25:46] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 20.00% above the threshold [20.0] [11:28:31] jynus, yes, I also see faster insertions [11:28:53] I think the system already catched up with the buffered events [11:29:53] that is faster than I expected [11:30:15] I'll check [11:30:26] I mean in a good way [11:30:35] Analytics-Backlog: Guidance on using the JSON output of T44259 - https://phabricator.wikimedia.org/T121314#1880460 (Mrjohncummings) Ok, but the problem still stands of not being able to read the end result [11:30:36] hehe [11:30:51] it is part of a normal behaviour [11:31:11] yes, it seems last events inserted date from last minute [11:38:27] I see on eventlog2001 CRITICAL: Stopped EventLogging jobs: consumer/server-side-events-log consumer/mysql-m4-master consumer/client-side-events-log consumer/all-events-log processor/server-side-0 processor/client-side-0 forwarder/server-side-raw forwarder/legacy-zmq [12:00:39] jynus, this alarm is old though [12:00:49] no? [12:01:58] yes, sorry 28d 9h 57m 53s [12:02:22] np, we should probably remove that [12:03:11] analytics slave is also inserting new data now [12:03:25] cool [12:04:02] there is only one table giving problems, Echo_7572295 [12:04:13] mmm [12:04:13] which I suppose is preciselly the one I want to convert [12:04:20] what are the problems? [12:04:29] (those are not related to this downtime) [12:04:34] ok [12:04:55] those where already there, and this was a requirement to solve those [12:05:22] I see [12:05:32] those are all tracked on https://phabricator.wikimedia.org/T120187 [12:16:50] jynus, I think EL is doing completely fine [12:17:07] if you don't need me, I make a pause for lunch [12:21:10] Analytics-Tech-community-metrics, DevRel-December-2015: OwlBot seems to merge random user accounts in korma user data - https://phabricator.wikimedia.org/T119755#1880547 (Lcanasdiaz) In order to fix this issue I'm going to split this identity and execute again the procedure. I've seen this behaviour befor... [12:24:17] I agree and will do the same. If you need me I will be at #wikimedia-databases or #wikimedia-operations [13:31:43] joal: I think it's an edge case here, when the schema change in the same camus run, we need to force camus to always use the same schema when writing avro records [13:32:06] (Abandoned) Addshore: Count references using dumps [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/257945 (owner: Addshore) [13:32:39] I've submitted https://gerrit.wikimedia.org/r/#/c/259227/, will wait for otto to merge it [13:34:46] dcausse: the new schema you were testing was already in the camus jar ? [13:35:04] There is something that I miss I think :) [13:35:45] joal: when camus runs it reads the stream from kafka, each kafka msg contains a schema revId. We don't have problems to read message [13:36:22] but when camus writes the big .avro file, it uses the first message schema to encode it in the file metadata at the beginning of the file [13:36:22] dcausse: camus needs to have the associated schema for the revId is what I mean [13:36:37] riiight dcausse [13:36:47] That makes sense, but my question is still valid :) [13:36:58] but if during the same camus run a new message comes with a new revId, camus won't create a new file [13:37:17] oh sorry, yes the schema is in hte jar :) [13:37:48] I understand: camus tries to write avro with whathever schema it has already loaded if any instead of checking if the schema is different [13:37:57] yes [13:38:04] ok dcausse makes sense :) [13:38:17] You had multiple schemas in that jar, how awesonme :) [13:38:31] You knew what evolutions you were going to make :) [13:38:35] yes we must keep all schema revIds :/ [13:38:39] I can't plan that much :) [13:38:44] :) [14:16:28] (PS1) Addshore: Send wd-analysis ref output to graphite [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/259246 [14:17:02] (PS2) Addshore: Send wd-analysis ref output to graphite [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/259246 [14:17:38] (PS3) Addshore: Send wd-analysis ref output to graphite [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/259246 [14:17:47] (PS4) Addshore: Send wd-analysis ref output to graphite [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/259246 [14:20:44] (PS5) Addshore: Send wd-analysis ref output to graphite [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/259246 [14:24:49] (PS6) Addshore: Send wd-analysis ref output to graphite [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/259246 [14:25:34] (PS1) Addshore: Add refernece script to daily cron [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/259248 [14:25:48] spm spm spm (Sorry) [14:25:56] (CR) Addshore: [C: 2 V: 2] Send wd-analysis ref output to graphite [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/259246 (owner: Addshore) [14:26:09] (CR) Addshore: [C: 2 V: 2] Add refernece script to daily cron [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/259248 (owner: Addshore) [14:33:46] (PS4) DCausse: Drop support for message without rev id in avro decoders and make latestRev mandatory [analytics/refinery/source] - https://gerrit.wikimedia.org/r/255105 [14:57:34] ottomata: mind merging https://gerrit.wikimedia.org/r/#/c/259227/? so the next run have a chance to run. [14:58:28] (PS5) DCausse: Drop support for message without rev id in avro decoders and make latestRev mandatory [analytics/refinery/source] - https://gerrit.wikimedia.org/r/255105 [14:59:08] ah [14:59:10] you are here! [14:59:11] on it [14:59:27] thanks! :) [15:03:13] I think I've made all the possible mistakes with avro :) [15:03:36] Analytics-Backlog, Analytics-EventLogging, MediaWiki-extensions-CentralNotice, Traffic, operations: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1880786 (Ottomata) Yeah, it would have to be a special endpoint, and it'd likely would onl... [15:25:03] Analytics, CirrusSearch, Discovery, Patch-For-Review: Camus not reading in CirrusSearchRequestSet events with the new schema identifier and schema version - https://phabricator.wikimedia.org/T121483#1880812 (dcausse) Open>Resolved [15:25:42] Analytics, CirrusSearch, Discovery, Patch-For-Review: Camus not reading in CirrusSearchRequestSet events with the new schema identifier and schema version - https://phabricator.wikimedia.org/T121483#1879953 (dcausse) camus automagically backfilled, I think we can re-apply the mediawiki change now. [15:34:35] Analytics-Backlog, Analytics-Cluster, Patch-For-Review: Single Kafka partition replica periodically lags - https://phabricator.wikimedia.org/T121407#1880840 (Ottomata) I've also been pointed to https://issues.apache.org/jira/browse/KAFKA-2477 as a potential cause of this problem. Unfortunately, there... [16:11:46] Analytics-EventLogging, EventBus, Wikimedia-Logstash, Patch-For-Review: eventlogging syslog message not properly recognized by logstash - https://phabricator.wikimedia.org/T120874#1880899 (Ottomata) This seems to do the trick: https://gerrit.wikimedia.org/r/#/c/259271/ [16:12:10] joal, yt? [16:13:59] running home for meetings, back shortly... [16:19:50] mforns: hola. I am going to see if anyone cares about Echo_ events, otherwise i am going to blacklist that schema [16:20:02] nuria, hi! ok [16:20:13] nuria, does it have any problems? [16:20:13] mforns: i think none cares [16:20:57] nuria, I remember that someone wanted to keep this schema in the EL audit a couple months ago [16:22:09] nuria, the schema owner is Roan, and it has a non-default purging specification [16:22:14] I would talk with him [16:22:29] mforns: i just did sent e-mail [16:22:48] oh... ninja speed [16:24:46] dcausse: so many bumps on avro road, they are blog post worthy almost [16:26:04] nuria: yes, mistakes is the best method to learn :) [16:26:27] dcausse: it's more like everything we thought it will work out of teh box ... didn't [16:27:11] nuria: the good news is that oozie is pretty awesome, it back-filled everything automagically (except few partitions but I'll re-check in few hours) [16:42:08] (PS4) Bearloga: Functions for categorizing queries. (Work In Progress) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/254461 (https://phabricator.wikimedia.org/T118218) [16:43:18] (CR) jenkins-bot: [V: -1] Functions for categorizing queries. (Work In Progress) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/254461 (https://phabricator.wikimedia.org/T118218) (owner: Bearloga) [16:52:53] Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1880983 (Edgars2007) OK, will get serious. I think the problem is still there, @yuvipanda. Or at least is related to this one. In last few days some 15 blank queries (those, which you get after pressing "New query") have appeare... [16:54:10] dcausse: oozie and awesome do not appear together frequently [16:54:27] a-team: need to attend technology meeting, have sent e-scrum [16:54:28] :) [16:54:35] np nuria [16:54:41] I've seen it [17:01:55] Analytics-Backlog: Guidance on using the JSON output of T44259 - https://phabricator.wikimedia.org/T121314#1881012 (Milimetric) @Mrjohncummings the output of this API is not meant to be human readable. It's meant to be used to build other tools on top of it that make it easy for humans to consume the data.... [17:02:05] Analytics-EventLogging, EventBus, Wikimedia-Logstash, Patch-For-Review: eventlogging syslog message not properly recognized by logstash - https://phabricator.wikimedia.org/T120874#1881013 (Ottomata) The above change is live in beta and applied on deployment-eventlogging04 [17:02:07] madhuvishy: standup? [17:05:32] a-team I will not be at scrum either this morning... sorry for the late notice [17:05:42] np kevinator we started [17:05:44] np kevinator [17:08:47] Analytics-Cluster, Analytics-Kanban: Peer review (with research) of methodology of last access calculations - https://phabricator.wikimedia.org/T121534#1881021 (Nuria) NEW a:Nuria [17:09:58] Analytics-Kanban: Write pageview API blogpost {melc} [21 pts] - https://phabricator.wikimedia.org/T118471#1881037 (Milimetric) [17:11:01] Analytics-EventLogging, Analytics-Kanban, EventBus, Patch-For-Review: Send HTTP stats about eventlogging-service to statsd [3 pts] - https://phabricator.wikimedia.org/T118869#1881038 (Milimetric) [17:11:14] Analytics-Kanban, DBA: 2 hour outage to update mysql on EL slaves {oryx} [3 pts] - https://phabricator.wikimedia.org/T121120#1881040 (mforns) [17:11:15] Analytics-Kanban, Patch-For-Review, Puppet: Puppet support for multiple Dashiki instances running on one server [8 pts] - https://phabricator.wikimedia.org/T120891#1881039 (Milimetric) [17:13:33] madhuvishy: so you'll just do a single dashiki dashboard, right? And I'll add the rest [17:13:40] because mobile-reportcard is still limn [17:14:00] dashiki's only running language-reportcard, edit-analysis, and vital-signs [17:14:00] milimetric: okay - i'll do the edit one then [17:14:03] k, cool [17:14:09] ottomata: https://phabricator.wikimedia.org/T121112 [17:14:27] it's in next up on our board [17:14:30] moving to done [17:14:54] oh danke [17:22:07] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, hardware-requests, operations, iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1881062 (Milimetric) @Dzahn: I wasn't aware that it had changed from a VM to real hardware, where wa... [17:23:26] Analytics-Engineering, Wikimedia-Logstash: Zookeeper logging to Logstash - https://phabricator.wikimedia.org/T84908#1881063 (Milimetric) Open>declined a:Milimetric main take-away: if it's a lot of data, it's probably not useful anyway [17:28:53] Analytics-Kanban: Remove all-access and spider from top endpoint {slug} - https://phabricator.wikimedia.org/T121300#1881089 (Milimetric) Open>Invalid I'm crazy, "agent-type" is not even a parameter on this endpoint :) [17:32:28] Analytics, MediaWiki-extensions-Gadgets: Track the GadgetUsage statistics over time - https://phabricator.wikimedia.org/T121049#1881107 (Milimetric) Sounds good to me, Lego, but then you wouldn't need us. Feel free to add us back if you do. [17:36:38] Analytics-EventLogging, EventBus, Wikimedia-Logstash, Patch-For-Review: eventlogging syslog message not properly recognized by logstash - https://phabricator.wikimedia.org/T120874#1881138 (Ottomata) Theerree we go: ``` { "_index": "logstash-2015.12.15", "_type": "syslog", "_id": "AVGmtnTKa1... [17:38:55] Analytics-Backlog, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown: Implement Schema:ExternalLinkChange - https://phabricator.wikimedia.org/T115119#1881144 (Milimetric) >>! In T115119#1876036, @Beetstra wrote: >>>! In T115119#1828267, @Milimetric wrote: >>... [17:40:52] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1881171 (jcrespo) [17:40:53] Analytics-Kanban, DBA: 2 hour outage to update mysql on EL slaves {oryx} [3 pts] - https://phabricator.wikimedia.org/T121120#1881170 (jcrespo) Open>Resolved [17:49:39] Analytics: Upgrade daily/monthly aggregations of pageview dumps to new data files - https://phabricator.wikimedia.org/T90203#1881287 (Milimetric) Sorry, Erik, I misunderstood. I certainly appreciate the value-added that you mention. [17:56:16] Analytics-Backlog, Analytics-Wikimetrics, Puppet: Cleanup Wikimetrics puppet module so it can run puppet continuously without own puppetmaster {dove} - https://phabricator.wikimedia.org/T101763#1881402 (Milimetric) >>! In T101763#1864407, @yuvipanda wrote: > Hello! After every time we change any fundam... [17:58:18] Analytics-Backlog, Datasets-Webstatscollector, Language-Engineering: Investigate anomalous views to pages with replacement characters - https://phabricator.wikimedia.org/T117945#1881420 (Milimetric) Yeah, I agree, Language Engineering should have a look at this. [18:00:25] a-team: on my way to our goal meeting... be there in 5 mins [18:13:46] Analytics-Backlog: 'is_spider' column in eventlogging user agent data - https://phabricator.wikimedia.org/T121550#1881486 (Nuria) NEW [18:31:27] Analytics: kafka-tools fails to run on stat1002 - https://phabricator.wikimedia.org/T121552#1881580 (EBernhardson) NEW [18:33:29] Analytics-Backlog, Analytics-EventLogging, DBA: Drop tables MobileWebClickTracking_* from eventlogging db - https://phabricator.wikimedia.org/T120674#1881591 (jcrespo) Open>Resolved Tables dropped: ``` mysql -A -h m4-master log -e "SHOW TABLES like 'MobileWebClickTracking%'" (no results) ``` [18:33:52] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1847396 (jcrespo) MySQL at dbstore2002 has been upgraded and reconfigured, too. [18:34:33] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, hardware-requests, operations, iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1881604 (Dzahn) >>! In T116312#1881062, @Milimetric wrote: > @Dzahn: I wasn't aware that it had chan... [18:52:40] (CR) Bearloga: Functions for identifying search engines as referers. (7 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247601 (https://phabricator.wikimedia.org/T115919) (owner: OliverKeyes) [19:00:54] Analytics, Traffic, operations: Upgrade kafka for native TLS and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#1881737 (BBlack) NEW [19:01:09] Analytics-Cluster, Traffic, operations, Patch-For-Review: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1881747 (BBlack) Open>Resolved a:BBlack [19:02:01] Analytics, Traffic, operations: Upgrade kafka for native TLS and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#1881750 (Ottomata) a:Ottomata [19:02:04] Analytics, Analytics-Cluster, Traffic, operations: Upgrade kafka for native TLS and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#1881752 (BBlack) [19:02:42] Analytics, Analytics-Backlog, Analytics-Cluster, Traffic, operations: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#1881753 (Ottomata) NEW a:Ottomata [19:03:08] Analytics, Analytics-Cluster, Traffic, operations: Enable Kafka native TLS in 0.9 and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#1881760 (Ottomata) [19:07:27] (PS9) Bearloga: Functions for identifying search engines as referers. [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247601 (https://phabricator.wikimedia.org/T115919) (owner: OliverKeyes) [19:09:34] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, hardware-requests, operations, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1881764 (Dzahn) [19:10:15] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, hardware-requests, operations, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1746196 (Dzahn) I re-added vm-requests. But if we actually use stat1001 for it, then it doesn't need either kin... [19:10:40] (CR) Bearloga: "recheck" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247601 (https://phabricator.wikimedia.org/T115919) (owner: OliverKeyes) [19:11:46] Analytics-Backlog, Datasets-Webstatscollector, Language-Engineering: Investigate anomalous views to pages with replacement characters - https://phabricator.wikimedia.org/T117945#1881771 (Nikerabbit) I am afraid that at least I have no good idea what could cause this. One could be incorrectly truncated... [19:13:59] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, hardware-requests, operations, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1881789 (Joe) I think piwik will need to have its own database hosted on the very same machine, am I correct?... [19:15:15] wikimedia/mediawiki-extensions-EventLogging#517 (wmf/1.27.0-wmf.9 - b1a02ba : Tyler Cipriani): The build has errored. [19:15:15] Change view : https://github.com/wikimedia/mediawiki-extensions-EventLogging/commit/b1a02ba8e9c8 [19:15:15] Build details : https://travis-ci.org/wikimedia/mediawiki-extensions-EventLogging/builds/97054317 [19:20:44] Analytics-Cluster, Analytics-Kanban, operations, Patch-For-Review: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1881811 (Ottomata) [19:20:46] Analytics-Backlog, Analytics-Kanban, operations, Monitoring, Patch-For-Review: Turn off sqstat udp2log instance - https://phabricator.wikimedia.org/T117727#1881810 (Ottomata) Open>Resolved [19:20:54] Analytics-Cluster, Analytics-Kanban, operations, Patch-For-Review: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1881812 (Ottomata) Open>Resolved [19:20:57] Analytics-Cluster, operations, Patch-For-Review: Set up ops kafkatee instance as part of udp2log transition - https://phabricator.wikimedia.org/T96616#1881814 (Ottomata) [19:21:56] nuria: mforns The Echo schema purge guideline from the EL audit is Auto-purge recipientUserId, recipientEditCount, eventId + eventCapsule PII - which means they will get to keep the data for more than 90 days sanitized [19:22:29] so it's not true that if no one is looking at it now, they can't look at it later [19:23:09] madhuvishy: i see, thanks for looking at it. Since nobody i *think* nobody is running numbers I wonder why the sanitization strategy is set up that way. [19:23:26] madhuvishy: where was the sanitization info? [19:23:34] nuria: hmmm we spoke to the schema owners who wanted that - it's in the talk page [19:24:05] it may be confusing to them that we are asking again - although the option of retaining data in Hadoop only did not exist then [19:24:22] nuria: https://meta.wikimedia.org/wiki/Schema_talk:Echo [19:30:49] madhuvishy: but the option of "keeping" data is contingent on it being used, right? I am not sure anyone is using this table at all. [19:31:42] nuria: of course - the assumption is that when they wanted to keep it - they are using it. but I agree if it's this huge it's hard for them. [19:32:13] madhuvishy: right, and there are several tables like that one , literally not usable [19:33:48] sorry nuria madhuvishy, I was having a snack. Agree with madhuvishy [19:35:30] a-team: Please take a look at: https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q3_Goals#Analytics [19:35:44] we could sqoop the data into hadoop (one off) and keep it there, the problem is that in hadoop, there is no partial purging like in mysql (in the close future)) [19:36:03] joal: still around? [19:36:32] nuria, madhuvishy, in hadoop all events are removed after 90 days. Implementing the same purging as in mysql for hadoop is going to be quite a lot of work [19:36:42] mforns: and we shouldn't do it [19:36:46] mforns: yup [19:37:13] so, either we keep the schema, or move it to full auto-purging after 90 days [19:37:20] madhuvishy, nuria, although wait! [19:37:27] mforns: It is sounding like auto-purging [19:37:46] mforns: yesss [19:37:48] i guess i'm saying when we reach out to teams now we should look back into our notes on what they wanted to setup based on the audit [19:37:51] if I remember right, jaime is planning to keep the events just in the replicas, not in m4-master [19:38:13] so, m4-master will remove the events after 90 days for all schemas [19:38:20] mforns: aah [19:38:21] madhuvishy: have in mind that many of the teams we contacted on teh audit no longer exist [19:38:31] mforns: ahahaha [19:38:39] and only analytics-storage will be auto-purged with the algorithms we agreed on with the owners [19:38:44] nuria: hmmm not really - the audit is fairly recent [19:38:56] the schema owner is updated on the talk page [19:39:31] madhuvishy: true but who is analyzing flow events? [19:39:33] mforns: makes sense [19:40:08] I will follow up with jaime to make sure this is possible and it is the way the purging will be executed [19:40:21] madhuvishy: in this case "flow" is listed as "inactive" [19:40:31] madhuvishy: owner is oliver [19:40:56] madhuvishy: it will be great if no events were flowing in for flow (in any schema) but I am not sure that is the case [19:41:19] nuria: i thought we were talking about Echo [19:41:34] madhuvishy: ay SORRY, it was just another example [19:41:53] Maybe we should talk with jaime first, because iirc then when the auto-purging is set up, there won't be xxl tables in m4-master any more [19:42:25] mforns: yeah [19:43:05] need to leave for 5 mins, we're out of drinking water, brb [19:43:52] nuria: there's no data for the flow schema in the db [19:44:37] madhuvishy: no, but there is for FlowReplies, which is also roan's but I am not sure anyone is looking at that data either, the table is not as big though i think. [19:47:24] madhuvishy: Ideally we would only hold data that someone cares to look at, I am not sure in the case of flow and echo that is the case but you are right that the audit was not that long ago so maybe I am wrong on that. [19:48:21] nuria: yeah - i'm fine with asking them if they no longer want the data - but everyone may not know what the current setup is because we only reached out to the schema owner, so it would be good to communicate that this is the current status we have set up - and if things have changed for them - we could drop the data and change the purging strategy. [19:48:55] and we should update the talk pages for these schemas [19:49:45] madhuvishy: agreed. [19:50:36] nuria: https://docs.google.com/spreadsheets/d/1mQtbsbGHLbGsHeNaYFCdg6X4K1VSxjUStBpx4GmVbhk/edit?usp=sharing has the data from the schema audit in one place if you wanna look at it [19:51:03] madhuvishy: thank you! [19:51:14] np :) [19:54:44] back [20:01:51] hey all, coming from http://blog.wikimedia.org/2015/12/14/pageview-data-easily-accessible/. Right now it looks like the API is only backfilled to 2015-08-01; are their plans to make historical data available or will it only be forward-looking? [20:03:13] Analytics-Backlog, Analytics-EventLogging, MediaWiki-extensions-CentralNotice, Traffic, operations: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1882001 (Nuria) @tgr: Sorry but I certainly disagree that we want to allow events of *any*... [20:03:33] jurbanik: data will be available from 2015-05-01 - the backfilling jobs for may-july have been paused at the moment and will resume soon. [20:05:06] no plans to make historical data older than that available yet - this is because we started collecting pageview data per new pageview definition in May 2015. https://wikitech.wikimedia.org/wiki/Analytics/Pageviews has some info. [20:05:11] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1882023 (Nuria) @jcrespo: Can we move to the next stage of DB maintenance? I listed it above as: > See how long does it... [20:07:50] gotcha - I'm utilizing pageviews as a feature for some time series prediction use cases, and need the historical data in order to learn meaningful shapelets / patterns. [20:08:05] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1882047 (jcrespo) Yes, we can choose one of the InnoDB tables mentioned on the description. Can you stop writes to a singl... [20:08:35] should I be worried that the dumps at http://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-10/ will stop being uploaded at any point? Despite the fact that they include spiders, etc, those dumps are useful because of the historical continuity [20:13:46] jurbanik, there are no plans to cease updating the dumps you mention in the foreseeable future :] [20:21:21] mforns, madhuvishy: great, thanks for the help. Any way to learn a little about the stack being used to power the API? [20:21:55] I just spent a while building my own parser and custom storage engine, and API for the dumps and am backfilling at a rate of ~35 days backfilled per machine day on ec2 r3xlarges. Would love to compare implementations [20:23:24] got the wind knocked out of me when I saw there was an official API, but the effort is validated since I need more historical data [20:26:44] jurbanik, about the API stack: here's a short presentation we did on it, it contains some links and a diagram. https://www.mediawiki.org/wiki/File:PageviewAPI.pdf [20:40:23] mforns: thanks Marcel! Congrats on deploying everything and keep up the good work [20:40:44] thanks jurbanik :] [21:05:34] bye a-team, see you tomorrow! [21:05:45] hasta luego [21:05:50] good night! [21:05:51] nite! [21:06:07] laters! [21:06:22] :] [22:14:14] (CR) Nuria: ">About integrating search referrers in the already existing classification, I >have a mixed opinion: do we merge the two referer classific" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247601 (https://phabricator.wikimedia.org/T115919) (owner: OliverKeyes) [22:46:46] madhuvishy: I tested your code on my machine, and everything's fine [22:46:54] there's only one change you need: [22:46:57] https://www.irccloud.com/pastebin/GbRm6jRx/ [22:47:08] (the URLs in the tests you added weren't updated when we renamed) [22:47:09] milimetric: aaah [22:47:16] yeah must have forgotten [22:47:32] cool, so I'll go over it one more time to see if I can nitpick anything, but all looks good to me [22:47:51] milimetric: okay :) thanks! [22:48:31] we have 399 tests. Most tested and least used project we have :) [22:50:23] milimetric: :D [23:24:21] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, hardware-requests, operations, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1883271 (Milimetric) @Dzahn & @Joe: Yes, this instance will need a database. And I agree with @Joe that if any... [23:25:37] Analytics-Backlog, Datasets-Webstatscollector, Language-Engineering: Investigate anomalous views to pages with replacement characters - https://phabricator.wikimedia.org/T117945#1883272 (Milimetric) p:Normal>High [23:27:42] Analytics-Backlog, Wikipedia-iOS-App-Product-Backlog, hardware-requests, operations, and 2 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1883286 (yuvipanda) This should probably be isolated in its own vlan or some such as well, since I strongly sus... [23:27:56] milimetric: i can delete this instance - dashiki1.analytics.eqiad.wmflabs? I'll make dashiki-01.analytics and dashiki-staging-01.analytics instead [23:29:11] madhuvishy: i think teh code on the tests that does not work should be removed though [23:29:17] cc milimetric [23:29:20] madhuvishy: sure, name it something so that Yuvi doesn't hate me anymore :) [23:29:29] madhuvishy: the part with task.get() [23:29:31] milimetric: :D okay [23:29:35] nuria: all the tests worked for me [23:29:45] I think the task.get thing just fails in vagrant [23:29:50] and we all know how I feel about vagrant [23:29:54] milimetric: i commented out some parts of the controller test [23:29:57] oh [23:30:00] hmmm [23:30:14] yeah, /me feel like coding on top of quicksand is not fun [23:31:06] milimetric: yuvi says make a dashiki project [23:31:24] instead of doing it all in analytics [23:31:46] madhuvishy: I'm 100% ok with whatever he says, because I neither know nor want to know anything about infrastructure :D [23:31:49] * milimetric ducks [23:31:54] milimetric: right, the issue is not vagrant [23:31:56] :D [23:32:10] nuria: but the tests run 100% for me [23:32:42] milimetric: with the commented code [23:32:47] milimetric: on the patch [23:32:57] I hate everyone unconditionally [23:32:58] milimetric: let me show ya [23:32:59] it's ok [23:33:14] nuria: oh, ok! no, it's fine, my bad / misunderstanding [23:33:21] then yeah, remove that instead of commenting out [23:33:28] milimetric: ok, it's the async nature of those tests (which were the only ones i could not convert to be synchronous) [23:33:36] milimetric: i was giving it another go [23:33:40] milimetric: yeah okay [23:34:03] milimetric: but in the absence of me fixing the deadlock issue I think the commented code should be removed [23:36:35] nuria / madhuvishy: there was another thing we used to do which was to label tests nonDeterministic or slow and then the script wouldn't run them: https://github.com/wikimedia/analytics-wikimetrics/blob/master/scripts/test#L10 [23:36:41] nuria: alright i'll remove it [23:37:10] :) so if you like some tests you can keep them in there with that flag and they won't run as part of the normal suite [23:57:37] o/