[00:12:55] Analytics, RESTBase, Services, HyperSwitch: Separate AQS off of RESTBase - https://phabricator.wikimedia.org/T126294#2010298 (mobrovac) NEW [01:41:16] Analytics-Kanban, Editing-Analysis, Patch-For-Review, WMF-deploy-2016-02-09_(1.27.0-wmf.13): Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx} [8 pts] - https://phabricator.wikimedia.org/T124676#2010513 (Jdforrester-WMF) [01:41:41] Analytics-Kanban, Editing-Analysis, Patch-For-Review, WMF-deploy-2016-02-09_(1.27.0-wmf.13): Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx} [8 pts] - https://phabricator.wikimedia.org/T124676#1962252 (Jdforrester-WMF) This is now good to go, from 2016-... [07:11:34] (CR) KartikMistry: [C: 2] "TODO: Scripts are OK. Output of these script need to go to limn-language-data graphs." [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/268974 (https://phabricator.wikimedia.org/T108158) (owner: Amire80) [07:11:58] (Merged) jenkins-bot: Adding statistics scripts [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/268974 (https://phabricator.wikimedia.org/T108158) (owner: Amire80) [08:35:45] o/ [09:13:13] Hi elukey :) [10:30:24] the kafka1012 situation is still *bad* [10:30:25] /dev/sdb3 1.8T 1.7T 96G 95% /var/spool/kafka/b [10:30:33] /dev/sdf1 1.8T 1.7T 148G 92% /var/spool/kafka/f [10:30:35] mwarf ! [10:30:46] elukey: have talked with ottomata yesterday about that ? [10:31:43] briefly, it seems that the logrotation happens every 7 days but we started only a couple of days ago getting a huge amount of data to catch up [10:32:01] I don't get it elucker [10:32:02] so until 7 days are expired, the log keeps growing [10:32:25] kafka1012 started a few days ago, and therefore have a lot of data to catch up, right [10:32:45] but why would it use ***much*** more disk space than other brokers ? [10:34:10] this is the same thing that I was wondering [10:34:48] we agreed to wait for today to see the growth before deciding to truncate [10:35:06] ok [10:35:31] elukey: the core main space is used by kafka data, right (not logs of errors or something else) ? [10:36:33] seems only upload and text, as yesterday [10:36:51] yes, other partitions are very small in comparison [10:40:24] joal: I am going to do some work for ops in the next two hours, then I'll restart working on this [10:40:27] :) [10:40:31] k [10:40:32] :) [12:04:15] joal: I took a look to the docs and I re-thought about what Andrew said [12:05:41] the directory with a huge partition (say webrequest_upload-10) contains logs dated Feb 5, that is when we restarted the broker. That data might have been past messages that kafka1012 needed to fetch as part of its recovery. [12:06:00] the problem might be that those files were created on other brokers with an earlier timestamp [12:06:41] so could have already been deleted because their week retention time has passsed [12:06:56] meanwhile the ones on kafka 1012 will be deleted on the 12th [12:07:24] elukey: deletion policy is based on message time, right ? [12:07:37] yep, file timestamp [12:09:09] at recovery time, on feb 5th, I assume kafka downloaded a lot of data from other brokers (catching up a few days) [12:09:37] creating all logs with a more recent timestamp [12:09:42] and keeping them for 7 days [12:09:43] But I wouldn't expect it to tag all of it for feb 5th [12:10:17] well it just created those files, and the unix timestamp will be the one used by the "cleaners" [12:10:19] That was my original question: does kafka delete baed on message timestamp, or data reception timestamp [12:10:34] file timestamp of the log on the file system [12:10:48] -rw-r--r-- 1 kafka kafka 512M Feb 5 16:30 00000000065290382906.log [12:10:51] this one for example [12:10:58] + 7 days [12:12:02] and it makes sense since it considers all the logs as byte streams, so they should be opaque to it [12:12:12] and the only timestamp is the file system one [12:12:15] right ok, makes sense [12:12:39] at least, this is my poor understanding after chatting with the Kafka master ottomata :D [12:12:45] I would have expected kafka to be a bit smarter on getting back after failure :) [12:13:00] But it makes sense [12:13:06] yep me too! We should consider this failure scenario when setting up downtime [12:14:00] So I guess you'll truncate old files with ottomata when he arrives, and we could possibly discuss augmenting the number of partitions for big topics to mitigate that issue [12:14:33] Meaning: worth case scenario: a broker has to have enough disk space for 14 days of data [12:15:08] If spreading webrequest upload and text acroos more disks, we could handle that without issue I think [12:15:12] elukey: --^ [12:15:22] 60 Feb 9 [12:15:23] 148 Feb 8 [12:15:23] 140 Feb 7 [12:15:23] 920 Feb 6 [12:15:23] 243 Feb 5 [12:15:35] this is upload-10 atm [12:15:44] number of logs + timestamp [12:16:06] grand total of ~800GB [12:17:46] joal: not sure what is the best path forward, you and Andrew have more experience than me on this so probably we'd need a discussion :) For the immediate future, Feb 5 and 6 might be good candidate for a truncation but I'd need to figure out how to delete things safely first [12:17:58] even though that data is replicated [12:18:31] but yeah Feb 6 is definitely the issue :) [12:19:05] elukey: http://stackoverflow.com/questions/16284399/purge-kafka-queue [12:19:24] elukey: It's not the exact thing we want, but we could take this idea as example [12:19:42] elukey: basically letting kafka purge the big stuff by itself [12:20:32] could be an option, there are tons of logs probably not needed anymore [12:21:06] also elukey, IIRC, we could set an upper limit to kafka data size per topic (in addtition to the time one) [12:21:29] we might be willing to consider that as well [12:22:47] yep yep [12:25:34] going out for lunch, brb! [12:45:01] back [13:12:07] Analytics-Tech-community-metrics, JavaScript: Syntax error, unrecognized expression on Korma profiles - https://phabricator.wikimedia.org/T126325#2011245 (He7d3r) [13:23:19] Analytics-Tech-community-metrics: Korma: Incorrect ticket count on top contributors page - https://phabricator.wikimedia.org/T126328#2011283 (He7d3r) NEW [13:26:01] Analytics-Tech-community-metrics, JavaScript: Syntax error, unrecognized expression on Korma profiles - https://phabricator.wikimedia.org/T126325#2011293 (Aklapper) Thanks for reporting this! (and playing with korma!) I'll merge those accounts! :) This task might be to some extend its own problem ("Synta... [13:30:10] Analytics-Tech-community-metrics, Developer-Relations, DevRel-February-2016, developer-notice: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#2011296 (Aklapper) a:Aklapper>Dicortazar >>! In T103292#198... [13:33:16] Analytics-Tech-community-metrics, JavaScript: Syntax error, unrecognized expression on Korma profiles - https://phabricator.wikimedia.org/T126325#2011300 (He7d3r) [13:45:48] joal: 08:41 PROBLEM - Disk space on kafka1012 is CRITICAL: DISK CRITICAL - free space: /var/spool/kafka/b 73705 MB (3% inode=99%): /var/spool/kafka/f 127290 MB (6% inode=99%) [13:54:53] ottomata: https://dpaste.de/1MVs <--- I broke down the upload/text partition logs by date for upload/text [13:56:16] o/ joal [13:56:26] will be running ~5 min late [13:56:33] See you soon :) [14:00:30] Hi halfak :) [14:00:37] np, will be there [14:03:44] so we could modify log.retention.hours=168 temporary to 4 days to remove the 5th and gain a bit of space [14:06:06] that'd be it elukey :) [14:07:05] it might be enough to do service kafka stop; edit /etc/kafka/server.properties; service kafka start [14:10:28] elukey: in a meeting, but I don't even think we need to restart kafka [14:11:17] if we contact zookeper probaly not, but if we want to modify server.properties yes [14:54:26] wikimedia/mediawiki-extensions-EventLogging#530 (wmf/1.27.0-wmf.13 - 1de03a1 : root): The build has errored. [14:54:26] Change view : https://github.com/wikimedia/mediawiki-extensions-EventLogging/commit/1de03a1d6b6f [14:54:26] Build details : https://travis-ci.org/wikimedia/mediawiki-extensions-EventLogging/builds/108041097 [15:16:38] milimetric: heyyYyy [15:16:42] should we do our thang? [15:17:05] ok, lemme get the build [15:17:15] (CR) Ottomata: [C: 1] "+1, but I don't have much context, so probably someone else should take a look." [analytics/refinery/source] - https://gerrit.wikimedia.org/r/268639 (https://phabricator.wikimedia.org/T125960) (owner: Joal) [15:17:18] k [15:17:31] ottomata: we should take some time for kafka with elukey [15:17:38] Ho, and HI ! :) [15:17:43] oh for dat disk [15:17:43] ja [15:17:45] hallloooo! [15:18:32] haaaalllooo [15:19:35] ottomata: I took a look to the logs layout on the disk, thanks for the hint yesterday :) [15:19:55] oh? [15:20:19] I finally got the ctime/mtime thing [15:21:24] ah ok great [15:21:42] also https://dpaste.de/1MVs is interesting if you haven't seen it [15:22:17] ah cool, yeah makes sense [15:22:21] bout the same for sdb i assume [15:22:25] its pretty full too [15:23:28] elukey: what if on this broker we should temporarily set log.retention.hours=24*2 [15:23:30] and restarted the boker [15:23:31] broker [15:23:49] i thikn it would delete those files from feb 5 and 6 [15:23:52] and free up lots of space [15:24:42] there are per topic settings that we can set on the fly [15:24:54] but, i think that would change the retention policy for all brokers [15:25:22] i also think that it might be fine to stop the broker and manually delete old files [15:25:28] but i would prefer to let kafka do it [15:26:19] ok ottomata, all good: https://gerrit.wikimedia.org/r/#/c/269420/ [15:26:32] deploying can happen, I'm on the three servers and will test locally and remotely [15:26:50] oook, so aqs deploy, will do check then deploy on all 3, ja? [15:26:56] yes [15:27:04] and we'll have to see if the local changes are cleanly wiped [15:27:39] what file was that again? [15:28:13] projects/aqs_default.yaml [15:28:30] but the new code has the same change [15:28:53] k [15:28:57] check looks good [15:28:59] proceeding with deploy [15:29:41] ahhh [15:29:42] msg: Failed to init/update submodules: error: Your local changes to the following files would be overwritten by checkout: [15:29:42] projects/aqs_default.yaml [15:29:45] cool [15:29:47] that's pretty cool [15:29:56] i will revert locally on each [15:30:28] ok deploying again [15:30:31] cool [15:30:36] yeah, it's good that it stops us [15:30:40] ya for sure [15:31:18] ok all 3 look good [15:31:26] k, testing [15:32:05] ottomata: yes I wanted to modify /etc/kafka/server.properties with 24*4 days initially to remove only Feb 5th as test [15:32:12] with service kafka restart [15:32:17] aye [15:32:29] elukey: if you want to proceed, i'll back you up [15:32:29] all right, sending a code review and then proceeding [15:32:35] ah, naw [15:32:40] don't bother with commit on this one [15:32:45] we'll stop puppet [15:32:48] nono unrelated CR [15:32:49] :) [15:32:49] make the change, restart kafka [15:32:52] ohoho [15:32:52] ok [15:32:58] hm, this doesn't exist: https://en.wikipedia.org/wiki/Superb_Owl [15:33:06] haha [15:33:10] that is a shame! [15:33:12] i kno [15:33:14] what an opportunity! [15:33:18] i kno! [15:33:22] ok, all's good tho [15:33:25] tests all pass [15:33:30] ok great [15:33:42] oddly the local tests are taking longer than usual now, but just a bit longer, probably fine [15:35:46] fail :/ https://en.wikipedia.org/wiki/Wikipedia:Article_wizard/Neologism [15:37:11] omg, I'm out, no wonder we don't have editors: https://en.wikipedia.org/w/index.php?title=Superb_owl&action=history [15:38:16] "No such subspecies so redirecting to the order" !!!! [15:40:10] :) [15:41:14] !log kafka broker restarted - kafka1012 [15:49:01] ottomata: shall we try with three days? [15:49:15] hmmm [15:49:28] it didn't really clear up any space, did it? [15:50:17] nope from what I can see, but 4 days gets us to Feb 5th and maybe timestamp are still "valid" [15:50:30] ja [15:50:31] looking at that [15:50:44] with 3 days we'd be sure [15:50:57] elukey: i think you are right [15:51:03] an old one i see is [15:51:03] 512M Feb 5 16:42 00000000048959000755.log [15:51:11] current time is 15:50 [15:51:26] joal, I've edited https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly/Sanitization. Can you check that I did OK? [15:51:31] elukey: let's do 2 days, most of the data is in feb 6 [15:51:32] I also posted some notes on the talk page. [15:51:48] ottomata: all right! moving the retention to 48 hours [15:51:51] k [15:51:56] And filed a task to add to wikitech -- https://phabricator.wikimedia.org/T126338 [15:53:26] ottomata: restarted! [15:53:41] !log restarted kafka1012 with 48hrs of log retention [15:53:54] lots of deletions looks like! :) [15:54:48] yesssssssss [15:55:05] oh yaa there it goes [15:55:30] seeing recovery with df -h [15:55:49] /dev/sdf1 1.8T 1021G 813G 56% /var/spool/kafka/f [15:56:43] ja [15:57:19] re-enabling puppet [15:57:37] joal: https://etherpad.wikimedia.org/p/analytics-notes [15:57:45] and setting back retention to the original value [15:57:46] at the top of that I put my draft for the email I want to send on Thursday [15:58:02] (the reason for sending Thursday is because I need it to make the opening joke work) [15:58:08] ok elukey deleltions look good [15:58:08] ja [15:58:09] continue [15:58:10] JDD - Joke driven development [16:00:13] * joal likes JDD ! [16:00:18] milimetric: +1 :) [16:00:24] Maybe some next steps ? [16:00:26] ottomata: all done, kafka restarted with 168 hrs retention limit [16:00:43] nice [16:00:45] elukey, ottomata : Thanks amilion guys for having handled that kafka thing :) [16:01:11] joal: I know what the technical next steps are, but I wanted to hold off to see if anyone disagreed. I'll mention that I'll send next steps if everyone agrees [16:01:25] joal: I am going to write a note in the Kafka admin page about how to purge logs! [16:01:32] awesome milimetric :) [16:02:23] halfak: Thanks for edition and comments [16:02:36] No problem. Sorry I didn't get farther. [16:02:54] * halfak loves the "please read this", "Oh! I'll edit while I read" pattern. [16:09:17] joal: reading more flink blog stuff [16:09:19] it looks really awesome [16:09:40] ottomata: yeah, I agree :) [16:12:25] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Kafka/Administration#Purge_broker_logs [16:13:24] nice gonna add soething [16:15:45] elukey: in your investigations did you notice if it was ctime or mtime? [16:23:38] ottomata: can you send a link with the flink articles you read ? [16:25:10] ottomata: I think that both are the same [16:25:19] sorry, both have the same value [16:25:32] becuause the log file doesn't get touched [16:25:36] no? [16:26:25] ja it is appended to [16:26:29] until it is rotated [16:26:40] ja joal just read this [16:26:41] http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/ [16:26:43] savepoins [16:26:47] savepoints [16:26:52] in 1.0, which is not released yet [16:26:54] right: statefull straming ! [16:27:05] sounds great, heh :) [16:30:56] halfak: how should I answer your comments on talk page (I actually have never done that, can you believe it ) [16:31:04] Analytics, Analytics-EventLogging, EventBus: Make eventlogging-service work in MW Vagrant - https://phabricator.wikimedia.org/T126346#2011724 (Ottomata) NEW a:Ottomata [16:31:12] joal, sure! [16:31:33] halfak: I modify the source, adding a portion for instance ? [16:58:00] ottomata: whenever you have time can you check that https://gerrit.wikimedia.org/r/#/c/268682/14 is ok? [17:04:25] Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: Add autoincrement id to EventLogging MySQL tables. {oryx} [8 pts] - https://phabricator.wikimedia.org/T125135#2011821 (Milimetric) a:madhuvishy [17:12:45] Analytics-Kanban: Add pivot parameter to tabular layout graphs {crow} [? pts] - https://phabricator.wikimedia.org/T126279#2011843 (Milimetric) p:Triage>Normal [17:14:35] elukey: cool +1 on the burrow template change [17:14:42] merge away! [17:15:53] * elukey is happy [17:18:34] Analytics-Kanban: Add pivot parameter to tabular layout graphs {lama} [? pts] - https://phabricator.wikimedia.org/T126279#2011860 (Milimetric) [17:18:38] Analytics-Kanban: Improve the data format of the browser report {lama} - https://phabricator.wikimedia.org/T126282#2011861 (Milimetric) p:Triage>Normal [17:26:45] mforns: have you seen this? https://tools.wmflabs.org/musikanimal/pageviews [17:26:55] milimetric, yes! [17:27:42] :) I'm glad everyone's working on this client thing, I think people just have a lot of fun doing that apparently [17:27:47] milimetric, it is awesome [17:29:44] heh "I had originally said this was not meant to a be a long-term solution, but I take that back. There are many more features coming!" [17:30:00] boy that wraps up computer science in one quote :) [17:30:15] hehe [17:34:24] Analytics-Kanban: camus-wediawiki job should run in production (or essential?) queue {hawk} [1 pts] - https://phabricator.wikimedia.org/T125967#2011913 (JAllemandou) [17:34:34] Analytics-Kanban: Use a new approach to compute monthly top 1000 articles (brute force probably works) {slug} [8 pts] - https://phabricator.wikimedia.org/T120113#2011914 (JAllemandou) [17:35:00] ottomata: currently doing that: https://phabricator.wikimedia.org/T125967 [17:35:17] I suggest to do it jointly for mediawiki and eventlogging - Your opinion ? [17:35:20] ottomata: --^ [17:36:40] hmm, joal i'm not sure [17:36:48] maybe prod, prob not essential [17:37:04] ottomata: ? [17:37:26] Concerned about resource prehemption? [17:38:31] a bit, i am not sure if is 'essential' :) [17:39:03] ottomata: For MediaWiki, pretty sure: Anlytics-Search has prod jobs (of their own), but prod nonetheless :) [17:39:14] For eventlogging: agreed :) [17:39:31] joal: i am not too opinionated about it [17:39:35] so whatever you think is fine [17:40:08] I prefer to have it in essential, since any prod job depends on it [17:40:15] I'll do that for MW only :) [17:43:42] CR posted ottomata :) [17:44:34] k cool [17:44:57] Analytics, Research consulting, Research-and-Data: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#2011940 (leila) Update: Erik and I met today and discussed this in more details. We will continue working on this card, and update this card with poi... [17:49:07] Analytics, Analytics-EventLogging, EventBus: Make eventlogging-service work in MW Vagrant - https://phabricator.wikimedia.org/T126346#2011946 (mobrovac) [17:54:32] Analytics, Research consulting, Research-and-Data: Update official Wikimedia press kit with accurate numbers - https://phabricator.wikimedia.org/T117221#2011984 (leila) [17:59:08] a-team: staff is on batcave [17:59:22] nuria: yeah I updated the link [18:00:05] ottomata: coming to stafff??? [18:24:09] *scratches head* hmm, ottomata, just looking at making this draft puppet module now, I guess I should make a user and have my cron run under that user (as with in the wdbuilder module) [18:24:22] where should this system users home directory be? [18:28:49] addshore: hm not sure [18:29:04] /var/lib/wmde-analytics ? xD [18:29:39] I guess putting it in home wouldn't be the worst thing, I doubt there is going to be a real user called wmde-analytics... [18:30:42] ja that is probably fine. addshore is this user maybe similar to the analytics-search user we added for discovery recently? [18:31:00] will this user be used to own files and do things in hadoop? [18:31:04] oooh probably, *searches in puppet for that* [18:31:15] ottomata: yeh, own files and run a cron [18:31:20] in hdfs? [18:31:35] hmm, no it wont actually own anything in hdfs at this stage [18:31:57] addshore: is this tied to stat1002/3 type box usage? [18:32:10] yup, most of it is tied to stat1002 [18:32:15] or coudl it run anywhere that has access to research dbs? [18:32:22] why stat1002? logs in /a? [18:32:36] some of it just needs the research dbs actually, othe bits need the logs and other bits need hive [18:32:51] ah it will run hive jobs? [18:32:52] this user? [18:33:04] yus! (at least 1) [18:33:29] ok, then yeah, i think you can/should model after analytics-search user. [18:33:41] you might want to parameterize your new puppet module to accept a user to use for stuff [18:33:48] and pass it in, but not manage the creation of it in that module [18:33:53] okay! :) [18:42:15] a-team, except if there is anything for which I'm needed, I'm gonna get some diner :) [18:43:33] going to log off too, byyyyeeeeee o/ [18:44:33] Analytics: Migrate the simplest limn dashboards to dashiki tabular {frog} - https://phabricator.wikimedia.org/T126358#2012091 (Milimetric) NEW [18:44:59] nuria: ^ listed out the different dashboards and what to do with each [18:45:28] * milimetric lunching [18:52:07] ottomata: one last thing! hhvm has been upgraded on all mw* servers, and kafka1012 is fine [18:52:20] so I am planning to start the reboot of the brokers for the kernel updates [18:52:27] (tomorrow morning) [18:52:44] I'll ping Marcel and Joseph beforehand [18:52:51] restarting one kafka node at the time [18:53:14] elukey, cool [18:53:25] this time I also have access to the console so I'll be able to check and fix if the host doesn't boot [18:53:44] ok! [18:53:46] elukey: sounds good [18:53:46] danke! [18:53:57] remember that 1012 is unbalanced [18:54:03] haven't run an election since [18:54:12] * ottomata lunching [18:55:22] all right I'll run the election first [19:21:07] ottomata: I added you to a really rough draft at https://gerrit.wikimedia.org/r/#/c/269467/, would be great to get some random comments on it from you! [19:25:35] milimetric: nice, will file ticket for EL IP removal now [19:30:18] Analytics: Develop Verify Merge scripts into Data Warehouse repo {mole} - https://phabricator.wikimedia.org/T88641#2012263 (Nuria) Open>declined [19:33:24] nuria: I tested the autoincrement change on beta and it mostly looks fine - except there are gaps in the autoincrement ids - they are in increasing order but i see big gaps - have you seen this behavior before? [19:34:29] addshore: commented [19:35:28] madhuvishy: gaps in autoincrement....mmm.. no [19:35:46] nuria: ah hmmm okay let me see if i can figure out why [19:36:36] madhuvishy: check engine on tables, should be tokudb [19:36:47] madhuvishy: let me give you select [19:37:06] nuria: okay - it is beta cluster so i assume it is - but sure [19:37:24] madhuvishy: SELECT table_name, (DATA_LENGTH + INDEX_LENGTH)/1024/1024/1024 as `TOTAL SIZE (GB)`, ENGINE, CREATE_OPTIONS FROM information_schema.tables WHERE TABLE_SCHEMA='log' /* AND `ENGINE` <> 'TokuDB' */ ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC LIMIT 30; [19:37:38] madhuvishy: we had to do some hand work to put tokudb there [19:38:10] madhuvishy: https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/TestingOnBetaCluster#Database [19:38:34] nuria: it is TokuDB - not all the tables - but the one I'm looking at is [19:38:39] madhuvishy: k [19:38:43] madhuvishy: that makes sense [19:38:58] because it was newly created [19:39:38] should puppetize that ^^ [19:39:42] or at least edit the my.cnf and add it [19:39:47] --default-storage-engine=tokudb [19:39:52] shouldn't have to remember to pass that on start up [19:40:42] ottomata: but the thing is that i do not even think the mysql install on that node is on puppet - or rather I do not know how to find it- [19:41:40] heh, dunno either [19:41:46] nuria: if not, then just edit my.cnf manually [19:42:28] ottomata: on mysql restarts right? i just edited it [19:42:38] ja [19:42:39] danke [19:42:45] ottomata: but [19:42:48] https://www.irccloud.com/pastebin/k1pUQsrn/ [19:43:00] sudo madhuvishy [19:43:05] :) [19:43:05] aahhh [19:43:25] why would it not say that [19:43:26] https://xkcd.com/149/ [19:43:51] :D [19:44:12] i dont know how to verify it started with the tokudb engine though [19:44:21] Analytics, Analytics-EventLogging: Add IP field only to schemas that need it. Remove it from EL capsule and do not collect it by default - https://phabricator.wikimedia.org/T126366#2012321 (Nuria) NEW [19:44:38] Analytics, Analytics-EventLogging: Add IP field only to schemas that need it. Remove it from EL capsule and do not collect it by default {mole} - https://phabricator.wikimedia.org/T126366#2012328 (Nuria) p:Triage>High [19:45:25] madhuvishy: let me see "show engine?" [19:47:02] madhuvishy: https://www.percona.com/blog/2011/11/29/avoiding-auto-increment-holes-on-innodb-with-insert-ignore/ [19:47:13] TokuDB | DEFAULT | Tokutek TokuDB Storage Engine with Fractal Tree(tm) Technology | YES | YES | YES [19:47:34] show engines [19:48:24] milimetric: I cannot change standup time , if you are the owner can you make the change? [19:51:07] madhuvishy: looks like it is a documented bug: http://bugs.mysql.com/bug.php?id=63128 [19:51:15] nuria: read this blog - that solution doesn't seem wise for us - what do you think? We can ignore the gaps - after all we just need them to be increasing in number [19:51:25] madhuvishy: no action needed [19:51:29] we should be careful to not use them as counts [19:51:31] that is all [19:51:46] madhuvishy: cause for us the important thing is that they increment so we can use them for replication purposes [19:51:54] yuupp [19:52:32] madhuvishy: once you are sure mysql coperates with autoincrement, let's deploy new code , make sure it works and notify jaime [19:52:35] nuria: okay then - i'll test this a bit more - but i think it's good to go [19:52:44] cool [19:59:08] Analytics: Research Spike: Provide data on about top pageviews across all projects daily - https://phabricator.wikimedia.org/T126367#2012359 (Nuria) NEW [20:00:16] Analytics, MediaWiki-extensions-ContentTranslation, operations: Make the command `sql wikishared` work on terbium like `sql enwiki`, `sql centralauth`, etc. - https://phabricator.wikimedia.org/T122474#2012373 (Nuria) @Amire80 closing as data is been gathered on 1002 now [20:07:44] Analytics-Kanban, Patch-For-Review: Lower parallelization on EventLogging to 1 consumer {oryx} [3 pts] - https://phabricator.wikimedia.org/T125225#2012410 (Nuria) Open>Resolved [20:08:04] Analytics-Kanban, Patch-For-Review: Modify wikimetrics local install to account for recent changes to puppet to get rid of the self-hosted puppet master {dove} [13 pts] - https://phabricator.wikimedia.org/T123749#2012415 (Nuria) Open>Resolved [20:09:03] Analytics-Kanban, RESTBase, Patch-For-Review: Update AQS config to new config format {melc} [5 pts] - https://phabricator.wikimedia.org/T122249#2012421 (Nuria) Open>Resolved [20:09:20] Analytics-Kanban: Bookmark-able graphs in Dashiki tabular layout [3 pts] {lama} - https://phabricator.wikimedia.org/T124298#2012428 (Nuria) Open>Resolved [20:09:44] Analytics-Kanban: Separate dashiki staging and prod hiera configs [1 pts] - https://phabricator.wikimedia.org/T126076#2012434 (Nuria) Open>Resolved [20:17:43] (PS1) Ppchelko: Removed RESTBase-specific code from AQS repository. [analytics/query-service] - https://gerrit.wikimedia.org/r/269477 (https://phabricator.wikimedia.org/T126294) [20:46:49] ottomata: where can I see camus logs? troubleshooting a problem, no data since Feb 3 13h-14h UTC [20:47:13] Analytics-Kanban, Patch-For-Review: Dashiki textual visualization [5 pts] {lama} - https://phabricator.wikimedia.org/T124297#2012682 (Nuria) Open>Resolved [20:47:23] ah they are on analytics1027 dcausse, [20:47:26] don't think you hae access [20:47:27] looking [20:47:45] Analytics-Kanban: Eventlogging should start with one bad kafka broker, retest that is the case {oryx} [5 pts] - https://phabricator.wikimedia.org/T125228#2012685 (Nuria) Can we add the upgrade ticket to this task? [20:47:56] 16/02/09 20:10:34 ERROR kafka.CamusJob: Error for EtlKey [topic=mediawiki_CirrusSearchRequestSet partition=3leaderId= server= service= beginOffset=794615265 offset=794615266 msgSize=300 server= checksum=860363735 time=1455045329229 message.size=300]: java.lang.RuntimeException: null record [20:47:59] I still don't know where's problem, mediawiki or camus, I'm currently fetching data from kafka to check [20:48:09] damn [20:48:24] Analytics-Kanban, Learning-and-Evaluation, Patch-For-Review: Add instruction text next to the input fields in the Program Global Metrics Report {kudu} [1 pts] - https://phabricator.wikimedia.org/T121899#2012693 (Nuria) Open>Resolved [20:48:31] dcausse: maybe i shoudl revert the snappy change? [20:48:44] Analytics-Kanban, Patch-For-Review: Reorganize oozie jobs to not use mobile cache webrequest_source {hawk} [21 pts] - https://phabricator.wikimedia.org/T122651#2012694 (Nuria) Open>Resolved [20:48:51] not sure it's related, because the problem started on Feb 3 [20:49:02] Analytics-Kanban, Analytics-Wikimetrics, Patch-For-Review: Include all timezones in global metrics report interface {kudu} [3 pts] - https://phabricator.wikimedia.org/T121167#2012696 (Nuria) Open>Resolved [20:50:56] oh ok [20:50:57] hm [20:50:57] right [20:51:02] running camus on my side to see [20:51:16] do you have ton of these errors? [20:52:09] yes, getting you yarn logs from last run... [20:52:41] ottomata, dcausse i have seen that error before and it was a red herring [20:53:09] ottomata: I think the thing to look for in the logs is: ERROR kafka.CamusJob: job failed: 100.0% messages skipped due to other, maximum allowed is 0.1% [20:53:57] nuria: ok [20:54:21] dcausse: we should check what is on topic just in case data is bad but I think that error might be missleading [20:55:00] nuria: I'm running camus to see, it's very slow so data may still be here... [20:55:22] dcausse: data should be there right? cause we keep days of it [20:55:22] I hope records are not broken :/ [20:55:48] dcausse: stat1002 /tmp/mediawiki-camus.log [20:56:08] ottomata: thanks [20:56:14] cc ottomata . Don't we keep like 7 days of buffer for data in kafka or ... i am ... making it up on my mind? [20:56:22] yes [20:56:24] that is correct [20:57:00] nuria: job failed: 100.0% messages skipped due to other, maximum allowed is 0.1% [20:57:02] :( [20:57:13] dcausse: ok, now that is a problem [20:57:47] dcausse: cause it means it exited w/o having process any data [20:58:03] yes so something wrong in the data :/ [20:58:07] dcausse: I would use kafkatee to see if things look ok ( data is there we keep 7 days) [20:58:29] nuria: ok will check [20:58:35] dcausse: and not only that , logs are TERRIBLE [20:58:52] dcausse: so painful... are you set up to run camus in 1002? [20:58:59] yes [20:59:08] nuria: means kafkacat [20:59:24] ottomata: yesssir kafkacat [20:59:42] will maybe use kafkacat because I'm not sure I can see all errors, I mean if something wrong happens in the mappers [21:00:01] hmm, reading an event out with kafkacat and parsing it with avro-tools-1.7.7.jar fragtojson looks to work [21:00:28] dcausse: i've got you a bit more logs in /tmp/mediawiki-camus-application_1454006297742_34085.log [21:00:31] it was getting really large [21:00:49] so i had to grep out some repeated message [21:00:54] and also ctrl ced cause it was takinga while [21:01:00] ebernhardson: mmmmm... schemas? are we using the right schemas [21:01:15] 2016-02-09 19:15:28,930 INFO [main] org.wikimedia.analytics.refinery.camus.coders.AvroBinaryMessageDecoder: Underlying schema registry for topic: mediawiki_CirrusSearchRequestSet is: org.wikimedia.analytics.refinery.camus.schemaregistry.KafkaTopicSchemaRegistry@477e697c [21:01:20] nuria: i double checked the header and it reports the correct magic byte and schema number [21:01:33] 2016-02-09 19:15:28,946 WARN [main] com.linkedin.camus.etl.kafka.mapred.EtlMultiOutputRecordWriter: ExceptionWritable key: topic=mediawiki_CirrusSearchRequestSet partition=4leaderId=18 server= service= beginOffset=794647097 offset=794647098 msgSize=231 server= checksum=3408991578 time=1455045328933 message.size=231 value: java.lang.RuntimeException: null record [21:01:33] at com.linkedin.camus.etl.kafka.mapred.EtlRecordReader.nextKeyValue(EtlRecordReader.java:295) [21:01:34] ... [21:01:36] basically i used the process i documented here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/MediaWiki_Avro_Logging#Test_reading_an_Avro_record_from_a_Kafka_topic [21:02:18] i only read a single value out though, so possibly we are sending null records at some point hmm [21:02:26] ebernhardson: you guys are such a winners .. thanks for documenting [21:02:56] ebernhardson: mmmm... with "some" records that are not valid parser might exit, that is configurable on properties [21:04:56] ebernhardson: nevermid is teh "max exceptions to print" [21:05:24] ebernhardson: I would try to decode from the offsets: beginOffset=794647097 offset=794647098 [21:07:27] ok lemme figure out the right kafkacat arguments, i know they have some options in there for that [21:07:33] ebernhardson: do more than 1 record and let me know if you need help [21:08:54] nuria: how long does it take for a new revision's table to be created? [21:08:59] on EL [21:09:14] nuria: unfortunatly my current test setup can only handle 1 record at a time :( i'll put on my list to put together a small consumer that can do more than one at a time [21:09:24] madhuvishy: depends on the batch size and timing , let me see cause mforns changed those settings [21:09:44] madhuvishy: also depends on replication, at teh master it should be visible as soon as it is created [21:10:08] nuria: is there a master for beta cluster? [21:10:21] madhuvishy: ah no, there it should take just couple mins [21:11:28] nuria: hmm it is taking longer - yesterday i inserted events to the Analytics schema - there was no table for it in the db - there were no errors - I could see the events validated in kafka - i left it for a while and came back and saw that the table existed [21:11:30] madhuvishy: see changes, should take 300 secs [21:11:32] https://github.com/wikimedia/eventlogging/commit/3c61014e8e9dab8839f3c52141d71923b41229b6 [21:11:41] right 5 minutes [21:12:05] i changed the Analytics schema today and inserted events to the new revision [21:12:32] its been a while definitely more than 5 minutes - but no table yet [21:12:35] ebernhardson: you can look at /home/nuria/avro-kafka/README/launch_camus_job_no_wrapper.sh [21:12:42] events are validated, are in files, kafka etc [21:12:45] ebernhardson: if you want to "mimic" 1 camus run [21:12:54] ebernhardson: you need properties file and build jars [21:13:12] madhuvishy: in beta labs? [21:13:15] yes [21:13:26] yo don't need to build jars [21:13:26] madhuvishy: do you see your events in all-events.log? [21:13:30] you can use the jars on stat1002 [21:13:30] yup [21:13:36] but you do need a custom properties file [21:13:40] ottomata: ah true, cause you do not have new changes [21:13:42] dcausse: is doing it [21:14:53] madhuvishy: man .....what else? let me log in into beta [21:15:17] nuria: sure - only thing was i restarted with that tokudb engine default in my.cnf [21:15:31] madhuvishy: before you sent events right? [21:15:37] nuria: yes yes [21:17:00] ebernhardson: kafkacat -t mediawiki_CirrusSearchRequestSet -o 794647098 -c 1 -b 10.64.5.13:9092 | dd iflag=skip_bytes skip=9 | java -jar ~ebernhardson/avro-tools-1.8.0.jar fragtojson --schema-file ~ebernhardson/CirrusSearchRequestSet_111448028943.avsc - [21:17:30] I can see the record but I have an EOF maybe it's normal [21:18:01] dcausse: the EOF is normal, i'm not sure what avro-tools expects at the end but it was always there [21:18:08] ok [21:18:20] madhuvishy: so this one has records as of recent: [21:18:24] https://www.irccloud.com/pastebin/7A2lndW0/ [21:18:29] will check with another record... [21:18:46] nuria: yes - but this is not the latest revision [21:19:00] i was seeing if insertion still worked to the old one [21:19:15] nuria: Revision 15332607 is latest [21:19:17] madhuvishy: ok, so i guess insertion is working [21:19:22] yes [21:19:39] nuria: kafkacat -b deployment-kafka02eployment-prep.eqiad.wmflabs:9092 -t eventlogging_Analytics -o -10 [21:19:39] data seems sane :/ [21:19:49] nuria: you can see the new revision events [21:20:22] can we see mysql insertion logs somewhere? [21:21:18] dcausse: your camus run works fine? [21:21:35] your test run? [21:21:37] ottomata: well it was running until now :/ [21:21:45] ? [21:21:47] Error: java.io.IOException: Failing write. Tried pipeline recovery 5 times without success. [21:22:06] but still running... [21:22:10] hm [21:23:14] o/ joal. Still around? [21:23:26] madhuvishy: so mysql log has no"15332607" [21:23:33] nuria: yess [21:23:33] madhuvishy: I would add logging to table creation [21:23:42] nuria: ok trying [21:23:45] madhuvishy: to see whether something is off [21:23:56] madhuvishy: remember to reinstall/stop/start [21:24:01] yup [21:24:07] madhuvishy: k let me know [21:24:49] nuria: where do these logs go? [21:25:07] ottomata: I have data [21:25:31] madhuvishy: to root@deployment-eventlogging03:/var/log/upstart# more eventlogging_consumer-mysql-m4-master.log | grep 15332607 [21:25:38] nuria: ah ok [21:25:40] madhuvishy: upstart manages them [21:27:04] bye a-team! cya tomorrow [21:27:15] good night, Marcel [21:27:23] :] [21:27:24] ottomata: data I have: hdfs dfs -ls /user/dcausse/camus/raw/mediawiki/mediawiki_CirrusSearchRequestSet/hourly/2016/02/03/14 [21:27:25] dcausse: ? [21:27:32] but is not in prod [21:27:33] that is from an import you just did? [21:27:36] with camus [21:27:38] yes [21:27:39] with snappy setting? [21:27:51] yes [21:28:24] ineresting [21:28:28] I've run camus with my own jar... [21:28:32] dcausse: are you using camus jar in /srv/deployment/analytics/refinery? [21:28:33] ah [21:28:34] yeah [21:28:44] we did a deploy feb 3 [21:28:47] of refinery [21:28:59] which has a new camus jar [21:29:04] hmmMMMmM [21:29:04] but [21:29:08] well [21:29:09] not a new jar [21:29:11] just a new versino [21:29:14] shoudln't be changes, not sure though [21:29:15] hm [21:29:21] let me check [21:29:21] looking [21:29:32] hmm [21:29:42] camus jar is the same [21:29:46] refinery-camus is a new versino [21:30:43] dcausse: can you paste command you are running to launch your camus? [21:31:03] /srv/deployment/analytics/refinery/bin/camus --run --job-name camus-avro-test-dcausse-2 -l ./refinery-camus-0.0.23-SNAPSHOT.jar mediawiki.properties --check &> camus-run-snappy.log [21:31:32] so it's an old jar [21:31:40] ja, but clearly something broke with the new one. [21:31:45] even though i don't see any changes in logs [21:32:01] heheh, i think the cron job should link against a version. lemme just change it [21:32:23] just like oozie, we shoudln't change the jars out from under the thing we know works, even if we build a new version [21:33:37] milimetric: this is how you get cache headers on api requests: https://meta.wikimedia.org/w/api.php?action=sitematrix&formatversion=2&format=json&maxage=20&smaxage=20 [21:33:43] dcausse: going to launch a prod camus run with that refinery-camus 0.23 [21:33:48] 0.0.23 [21:33:51] see if it is ok [21:33:54] ottomata: thanks [21:33:58] adding the maxage parameter! actually pretty smart, i like that [21:34:08] nuria: weirdddd [21:34:10] looking at refinery-camus but nothing obviously wrong :/ [21:34:24] it's like the opposite of what people usually do (cache-bust) but I love it [21:34:27] makes a lot more sense [21:34:44] milimetric: super smart actually, i have never used that in any api i designed before [21:34:51] man, that should be in the HTTP spec :) [21:35:42] M just like oozie, we shoudln't change the jars out from under the thing we know works, even if we build a new version [21:35:46] AMEN [21:36:00] ottomata: cause a new version might bring new dependencies [21:36:09] waiting for current camus run to finish... [21:36:27] nuria: the cron job in puppet is using the symlink to latest jar [21:36:29] tsk tsk on me [21:36:57] ottomata: the thing is that lates should not be build if there are no changes [21:37:07] indeed [21:37:12] ottomata: i do not think the symlink is the issue [21:37:14] so ja, this *shouldn't* break [21:37:15] but even so [21:37:18] ottomata: you would expect that [21:37:20] no its not the symlink [21:37:31] but, since the cronjob launches using the symlinked jar [21:37:33] if we deploy [21:37:40] it automatically will launch the next job with the latest jar [21:37:46] instead of the one it has always been working with [21:37:48] ottomata: no schema in /mnt/hdfs/wmf/refinery/current/artifacts/refinery-camus.jar [21:37:54] ! [21:37:58] that'll do it! [21:38:01] something busted with build process then [21:38:05] maybe a build problem, ebernhardson added a submodule [21:38:25] dcausse: i bet git init wasn't run, that is probably it [21:38:32] probably [21:38:37] i think that's it [21:38:38] i built [21:38:40] didn't have it cloned [21:38:41] bah! [21:38:53] so many moving pieces ... [21:38:56] we should make something in the build fail if the repo doesn't exist [21:39:01] yes :/ [21:39:05] it turns out there is a reason some people hate submodules :( [21:39:14] ottomata: YES YES [21:39:45] but the camus error message is pretty confusing :/ [21:40:12] indeed [21:41:05] hmm dcausse we should also probably change kafka.max.pull.minutes.per.task=55 [21:41:08] to 10 or something [21:41:17] do you read mine? [21:41:32] ? [21:41:37] or mediawiki.properties from prod? [21:41:49] mediawiki.properties [21:42:02] in puppet [21:42:05] ah ok [21:42:23] i'm scared to kill the yarn job, even though it should be fine...i'm just gonna wait [21:42:26] ottomata: I don't really what it is so feel free to change it [21:42:31] but i have to wait 55 minutes :) [21:42:34] :) [21:42:39] ah, its the length of time camus is allowed to run [21:42:42] if it reaches the end [21:42:50] of time [21:42:53] it'll stop reading from kafka and just write what it has [21:42:55] to hdfs [21:42:59] ok [21:43:12] the cron is set to run every 15 mins [21:43:16] so we should make it less than that [21:43:19] 13 or 14 minutes maybe [21:43:26] makes sense [21:43:34] that way, when there are problems, we don't have to wait the full amount of time before we can make a change [21:43:36] well [21:43:39] full 55 minutes [21:44:48] dcausse: , nuria, annoying thing about hardcoding libjars refinery-camus version [21:44:52] is that a schema change will now require a p uppet change [21:45:01] so the cron uses the new jar with the schema change [21:45:15] schema in the jar is not really ideal :( [21:45:18] but, i guess that is better than errors just cropping up when we aren't looking [21:45:20] yeah indeed [21:47:20] nuria: I don't see anything I'm logging being logged there [21:47:24] not sure what i'm missing [21:49:36] madhuvishy: in m4 log? [21:49:50] madhuvishy: ahh , did you build? like : [21:50:32] ~>/srv/deployment/eventlogging/eventlogging# python setup.py install [21:52:50] and after restart [21:54:34] nuria: gah - i've been changing in my home folder - i did do setup.py install but i know why [21:54:44] madhuvishy: k [21:54:54] madhuvishy: in beta I normally change the main thing [21:55:03] the main ..ahem .. checkout [22:03:51] nuria: yeah - i put some logging.error lines in jrm.py [22:04:08] nothing on logs though [22:05:01] madhuvishy: man ..... [22:05:14] madhuvishy: gotta love the mysql side of eventlogging [22:05:23] nuria: dont know why nothing on logs - may be i'm looking at the wrong place [22:05:29] madhuvishy: where is your code? [22:05:56] nuria: /srv/deployment/eventlogging/ [22:06:46] nuria: /srv/deployment/eventlogging/eventlogging/eventlogging/jrm.py [22:06:54] madhuvishy: k, changing [22:06:59] may be i can add some to the consumer [22:08:15] nuria: i think i know why [22:08:31] madhuvishy: I was going to suggest "print" [22:08:34] rather than logger [22:08:52] nuria: it should be logger.error not logging.error [22:09:02] madhuvishy: k, i will let you change [22:12:48] dcausse: ok, starting short camus mediawiki run with 0.0.23 [22:13:45] oh, i lied before camus mediawiki runs every hour [22:13:50] hm, maybe we should shorten it [22:13:54] in another change :/ [22:13:55] meh [22:13:55] :) [22:17:51] ottomata: ok thanks :) [22:19:05] looks like it ran [22:19:08] hm [22:19:18] ottomata: if i put log statements in eventlogging/jrm.py where would it show up? [22:19:41] ottomata: no new data, strange :/ [22:19:43] how are you running eventlogging? [22:19:54] dcausse: i only ran it for 5 minutes [22:19:55] but yeah [22:19:55] hm [22:20:21] ottomata: in beta cluster - just changing inside /srv/deployment/eventlogging - setup.py install - and restarting [22:20:28] oh i see some errors [22:20:35] madhuvishy: shoudl be in /var/log/upstart then [22:20:53] oh those are fine errors [22:20:57] just topci not fully pulled [22:21:00] in eventlogging_consumer-mysql-m4-master.log? [22:21:08] 16/02/09 22:17:42 INFO kafka.CamusJob: HDFS: Number of bytes written: 15857 [22:21:10] probably madhuvishy [22:21:16] hmmmm [22:23:38] dcausse: i have to run :( [22:23:47] oooOOOO [22:23:48] hm [22:23:55] i will bring my computer with me and check in on this later [22:24:33] ottomata: no problem, going to sleep, send me a mail if you want me to do something tomorrow morning [22:24:58] ok [22:28:57] Analytics, RESTBase, Services, HyperSwitch, Patch-For-Review: Separate AQS off of RESTBase - https://phabricator.wikimedia.org/T126294#2013086 (mobrovac) Status: @Milimetric created a new source repository directly in Gerrit - [analytics/query-service](https://gerrit.wikimedia.org/r/#/admin/proj... [22:29:15] Analytics, RESTBase, Services, HyperSwitch, Patch-For-Review: Separate AQS off of RESTBase - https://phabricator.wikimedia.org/T126294#2013088 (mobrovac) [22:43:34] (PS2) Mobrovac: Removed RESTBase-specific code from AQS repository. [analytics/query-service] - https://gerrit.wikimedia.org/r/269477 (https://phabricator.wikimedia.org/T126294) (owner: Ppchelko) [22:44:16] (CR) Mobrovac: [C: 2 V: 2] Removed RESTBase-specific code from AQS repository. [analytics/query-service] - https://gerrit.wikimedia.org/r/269477 (https://phabricator.wikimedia.org/T126294) (owner: Ppchelko) [22:44:34] Analytics, RESTBase, Services, HyperSwitch, Patch-For-Review: Separate AQS off of RESTBase - https://phabricator.wikimedia.org/T126294#2013138 (mobrovac) [23:18:12] nuria: something is weirdddd [23:27:14] Analytics, RESTBase, Services, HyperSwitch, Patch-For-Review: Separate AQS off of RESTBase - https://phabricator.wikimedia.org/T126294#2013325 (mobrovac) [23:27:16] Analytics, Beta-Cluster-Infrastructure, Deployment-Systems, Services, and 3 others: Set up AQS in Beta - https://phabricator.wikimedia.org/T116206#2013324 (mobrovac) [23:28:15] Analytics, RESTBase, Services, HyperSwitch, Patch-For-Review: Separate AQS off of RESTBase - https://phabricator.wikimedia.org/T126294#2013332 (Pchelolo) [23:33:51] Analytics-EventLogging, scap, Scap3: Move EventLogging service to scap3 - https://phabricator.wikimedia.org/T118772#2013366 (greg)