[09:58:19] !log Restart cassandra on aqs1001 [09:58:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [10:15:50] Analytics-Tech-community-metrics, DevRel-October-2015, Patch-For-Review: Fine tune "Code Review overview" metrics page in Korma - https://phabricator.wikimedia.org/T97118#1752557 (Aklapper) Open>Resolved This is deployed now and live on korma. Closing as resolved. [10:16:58] Analytics-Tech-community-metrics, DevRel-October-2015, Patch-For-Review: Present most basic community metrics from T94578 on one page - https://phabricator.wikimedia.org/T100978#1752559 (Aklapper) [10:19:22] Analytics-Tech-community-metrics: Add remaining KPIs to kpi_overview.html once available in korma - https://phabricator.wikimedia.org/T116572#1752560 (Aklapper) NEW [10:20:02] Analytics-Tech-community-metrics, DevRel-October-2015: Tech community KPIs for the WMF metrics meeting - https://phabricator.wikimedia.org/T107562#1752573 (Aklapper) [10:20:04] Analytics-Tech-community-metrics, DevRel-October-2015, Patch-For-Review: Present most basic community metrics from T94578 on one page - https://phabricator.wikimedia.org/T100978#1752569 (Aklapper) Open>Resolved Initial version is available now on http://korma.wmflabs.org/browser/kpi_overview.html... [10:21:18] hey joal, you around? [10:21:30] https://druid.wmflabs.org/pivot [10:21:44] it doesn't seem to handle the huge article dimension very well [10:21:52] but you can get top level stats fairly well [10:22:43] hey milimetric [10:22:45] (you can also query it for top articles, etc. but it seems to take like 1 minute on the first query) [10:22:52] Have you slept a bit ? [10:23:05] :) [10:23:07] nah, plenty of time to sleep when we're dead :) [10:23:19] riiiiiight, I know the thing ;) [10:23:28] That's cool ! [10:23:38] So what about data size ? [10:24:37] Also milimetric, restarted cassandra this morning, a lot of loading have failed this weekend, restarted everything [10:24:57] :( [10:25:02] yes :( [10:25:11] yeah, we can take a look at the instance on labs together if you want [10:25:20] but it looks to me like data size is about 15 GB [10:25:22] for a day [10:25:24] hourlyy [10:25:38] which is pretty sweet, even 3x replicated it would only be 45GB [10:26:43] I have double checked as well the data we played with at mexico: there have been scope move for pageview per article [10:26:48] indeed, that's cool ! [10:28:04] yeah, i don't think the scope we tried to work on in mexico was set in stone or anything [10:28:29] the additional dimensions if I recall correctly were from people asking for them on the api thread on phab [10:28:34] and I think we said no to a lot of others [10:29:28] milimetric: yeah, I still fill bad about not having double checked on the size :( [10:29:33] anyway, moving forward :) [10:30:49] well, i think this is just the detail of what we expected though. [10:30:53] so i mean we were prepared for this [10:31:06] we said as we add dimensions cassandra's not going to scale [10:31:14] it just happened right away instead of in a few months [10:31:26] so we remove hourly (totally fine for a Looot of people) [10:31:33] true [10:31:33] like, dude, people are Stoked right now [10:31:47] :) [10:31:53] including me, obviously, I've been high all weekend :) [10:31:58] hehehe [10:32:06] I'm sure it's bacause of the druid magics ;) [10:32:17] plants and all ;-P [10:32:23] nah, that's been annoying (not because of druid, but labs storage limitations) [10:32:26] really fun stories to tell [10:32:32] heh [10:32:56] but now, that we know we have to look at other tools, we get to play with cool stuff [10:32:59] so that's fun for us too [10:33:09] for sure it :) [10:33:13] i think this is a win all around. Except for those poor people who wanted hourly data yesterday [10:33:19] right [10:33:35] So, about druid, happy so far ? [10:33:53] well the data just loaded [10:34:10] labs is obviously not the place to try a local map reduce on 35GB of data [10:34:23] :D [10:34:32] because it generated about 70GB of temp files before it compacted them down to 15 [10:34:52] so I had to like watch it and move some of its temp files to the NFS drive and replace them with symlinks [10:34:53] lol [10:34:56] so nuts [10:35:12] wow [10:35:14] but it kept going [10:35:17] cool [10:35:26] took i think something like 40 hours [10:35:37] I have easy chart of hourly pv for user en.wikipedia :) [10:35:48] oh on pivot? [10:35:51] yessir [10:35:56] cool, yeah, it's decent at the higher level stuff [10:36:05] but if you try anything with the article dimension in pivot it dies [10:36:09] makes sense, that's a huge dimension [10:36:14] yes [10:36:34] Have you tried to query at article level without piovot ? [10:37:40] I was just trying to make a query for that [10:37:50] awesome [10:37:59] but I remembered I saw "filer" in the pivot ui, gonna try that for a sec [10:38:16] did try that --> not working [10:38:54] yeah, times out :) [10:39:03] it's ok, they say themselves this is very very early - alpha [10:39:18] ok [10:39:59] About druid install / config, I'd love to know more and spend some time with you at some point :) [10:40:17] oooh, fast!!! [10:40:27] 1.5 seconds hourly pageviews for Obama with agent user [10:40:35] not bad ! [10:40:37] (first query, not cached) [10:40:46] Can you send me an example query ? [10:40:50] yes, one sec [10:42:16] ok, joal, so put this in a file called pageviews-timeseries.json: [10:42:19] https://www.irccloud.com/pastebin/xAaNKUVX/ [10:42:27] and then run this: [10:42:35] time curl -L -H'Content-Type: application/json' -XPOST --data-binary @pageviews-timeseries.json http://druid.wmflabs.org/druid/v2/ [10:43:16] with more details here: http://druid.io/docs/latest/querying/timeseriesquery.html [10:45:15] fast indeed milimetric :) [10:45:27] tried a few, and seems to work :) [10:46:28] less than 2 secs for daily aggregation :) [10:46:31] nice :) [10:46:32] yeah, pretty fast without the "and" filter [10:46:52] like if you just pull up the "type": "selector" part directly under "filter" [10:47:05] that it seems to cache [10:47:10] whereas the fancier filters it doesn't [10:47:31] but even not cached the queries run fast. And this is just labs also, of course [10:47:58] honestly, my first sense is that if we piped this through RESTBase it would be fine [10:48:07] ok milimetric, that sounds like a good idea :) [10:48:33] worst case we can rate-limit it on the RESTBase side if we find there's a way to kill it [10:48:35] So that means: 1 - puppet for druid [10:48:42] 2 - loading for druid [10:48:49] well, let's get the rest of the team's opinion of course [10:48:49] 3 - restbase interface for druid [10:48:54] of course :) [10:48:57] loading is really trivial [10:49:11] what do you need ? [10:49:13] it's like a trivial query to pageview_hourly, and no more cubes or anything needed [10:49:16] and a json task [10:49:20] hang on, I'll paste that [10:49:23] sure [10:49:40] k, this is what worked: [10:49:43] https://www.irccloud.com/pastebin/WBp7vZm5/ [10:50:19] notice the "ignoreInvalidRows" which we do have to deal with (special characters because we "faked" json output from Hive with string concat) [10:50:22] you loaded using haddop ? [10:50:31] yes, technically [10:50:31] makes sense milimetric [10:50:39] but it just runs hadoop in local mode on the labs instance [10:50:46] so we ran this query on Hive: [10:50:52] so you had to install hadoop locally on labs ? [10:51:04] https://www.irccloud.com/pastebin/U1mvFCpG/ [10:51:16] no, joal it came with all the jars it needs for this install [10:51:20] yup, makes sense [10:51:24] ok, great [10:51:28] all i did was curl down the tar and extract [10:51:42] and make sure java was installed [10:51:55] and you have a single server started, have you ? [10:52:09] so after running that we zipped and moved the file to labs and posted that JSON task to druid the same way you post the queries (same exact) [10:52:19] I made a systemd service for druid [10:52:36] so if you login to druid1.eqiad.wmflabs you can manage it with systemctl * druid [10:52:53] fantastic :) [10:52:54] and I made an nginx proxy for pivot, just listens on 80 and forwards to 9095 [10:53:24] i mean, actually, if we remove this data from it people could use this instance on labs like a mini labs-only druid to play with [10:53:30] i think it doesn't really need anything else [10:53:38] you don't even have to login to do most things [10:53:48] that's great :) [10:54:03] I have the pivot source locally too so I can hack it to do more graph types if I want :) [10:54:16] anyway, i'm probably gonna go back to sleep now [10:54:38] huhuhu [10:54:45] You really spent the night on it :) [10:55:01] Well, have a good sleep dude, that's awesome work you did here [10:55:30] aw, nobody viewed my user page on october 14th :( [10:55:31] lol [10:55:33] And you were since the beginning: Druid was the option, not cassandra :) [10:55:42] :D [10:56:01] oh, psh, I just like those guys and respect what they've done, I am no fortune-teller [10:56:09] it was a biased opinion not an intelligent guess [10:56:40] yes, but still, guts feeling is important, even in technical scopes :) [10:56:43] I respect that [10:57:00] Let's talk with the team on how me can make the move [10:57:23] hah! User:Eloquence has a few pageviews every hour, up to 10 [10:57:44] man, you gotta be really famous to get even a couple pageviews here and there [10:57:46] maybe test some bigger druid setting with some spares [10:57:49] if we can :) [10:58:16] yes, that'd be fun [10:58:22] and a real Hadoop instead of a fake local one [10:58:28] see how fast the load is then [10:58:34] yes ! [10:58:36] because... 40 hours for 24 hours of data's not gonna work [10:58:37] :D [10:59:09] hm, making sure we don't kill druid as well: 30 haddop machines for 3 druid ones, even if druid is strong, it might be too much :D [10:59:54] anyway, go to sleep mate :) [11:00:05] I'll see you at standup :) [11:01:22] yes [11:01:31] Hillary Clinton's getting more views than Barack Obama [11:01:32] cool :) [11:01:39] INTERESTING [11:01:45] I love datas [11:02:14] :D [11:03:06] I don't agree with your analysis on data not taking more space: since it's by-segment, it won't compress much (depending on the segment grqanularity we choose) [11:04:41] PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection timed out [11:04:51] hmmm, also milimetric, making the computation about one year of data, taking 15Gb daily, that still makes 5.5T a year [11:05:26] we need to discuss/review al that :) [11:06:21] RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 0.006 second response time on port 9042 [11:09:49] Oh, true, right, so within each segment it'll be small. [11:09:59] yes milimetric, I think so [11:10:16] Format is kinda like parquet - columnar, plus indexed on values [11:10:17] we won't get compression, I think that makes sense, druid is basically compressing here [11:10:24] yessit [11:10:25] yep [11:10:41] hey, shouldn't you be in bed ? [11:10:45] :-P [11:10:53] I am, ipad chatting now [11:11:20] :) [12:19:35] Analytics-Backlog: Fix layout of the daily email that sends pageview dataset status - https://phabricator.wikimedia.org/T116578#1752766 (mforns) NEW [12:33:13] joal, yt? [12:44:19] mforns: we should talk more when everyone's around [12:44:27] but I was thinking we can still do hourly [12:44:32] milimetric, hi! about what? [12:44:34] because druid can store in hdfs [12:44:38] hey mforns [12:44:47] ah! ok [12:44:50] oh, i'll let you talk to joseph, we can all hangout after if you wanna chat :) [12:44:51] hey joal [12:45:02] no no, that's ok [12:45:30] I was going to ask joal if I could help him in something, but now I started the EL backfill, so I already have a task [12:45:45] ok mforns :) [12:54:35] * joal is gone for lunch [12:55:10] bon appétit [13:59:47] joal: milimetric: aqs seems to be acting up, possible cassandra issues [14:02:39] gr [14:03:03] mobrovac: how'd you notice, and is this normal [14:03:19] no, it's not normal milimetric [14:03:24] reported in #ops [14:04:26] milimetric: joal: have you stopped cassandra on aqs1001 a long time ago? [14:05:05] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1752974 (Ottomata) I'm still a little confused about how this reqid/id will work? You are suggesting that it comes from the x-r... [14:05:27] mobrovac: joal is doing some loading, I know it failed to insert over the weekend and he restarted stuff this morning [14:05:33] but i didn't hear about anything being down [14:05:56] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1752975 (Ottomata) To avoid possible conflicts, I'd suggest we call this not just `id`. How about `uuid`? That's what EventLog... [14:06:01] cass aqs1001 is marked as down [14:07:22] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1752979 (Ottomata) Also, this is just a personal preference, but I'd prefer if we had a convention differentiating integer/secon... [14:09:36] joal / mobrovac: my guess is the 1.3TB of data is the problem [14:10:00] so let's truncate the old pageviews-per-article and re-load with just daily data? [14:10:03] maybe that'll help? [14:11:44] restarted cass on aqs1001, seems good now [14:16:40] mobrovac: did you truncated the table? [14:16:58] nope, just restarted [14:17:34] each of the nodes has between 800 and 900 GB of data [14:19:08] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753027 (Ottomata) Also, over at [[ https://phabricator.wikimedia.org/T88459#1694274 | T88459#1694274 ]], I commented: If we a... [14:19:50] mobrovac: can you please dump the table? [14:20:05] mobrovac: it is pretty worrisome that aqs goes down and we do not know why [14:20:21] mobrovac: let's rule out space issues by freeing space we donot need [14:20:25] dump the table nuria? [14:21:22] milimetric / joal / ottomata should do that so they learn it :) [14:21:59] soo [14:22:07] but since at this point it is completely unclear to me why that one instance went south just randomly deleting data doesn't sound like a good idea [14:22:08] a whole bunch of EventLogging tables simultaneously stopped picking up new events [14:22:12] mobrovac / nuria: we need to be ready to load the daily before we truncate [14:22:15] let's wait for joal [14:22:21] I'm going to shout at the individual teams but before I do that, just to check, EL isn't borked or anything, right? [14:22:35] milimetric: ok, mobrovac do you have a guess why cassandra is down? [14:22:40] Ironholds: i'll check [14:22:49] milimetric: i can check that [14:22:56] nuria: it's not down, it's up and running [14:22:56] milimetric: since it is my EL week [14:23:08] mobrovac: it was don this morning I thought, right? [14:23:16] mobrovac: at least that is what the alarms said [14:23:33] milimetric, danke! Both mobile web and mobile apps search tracking just turned off on the 22nd. I'm wondering what happened. [14:23:35] yes, only on aqs1001, but no idea why [14:24:00] milimetric said joal was doing something there so i guess we should for him to tell us what exactly [14:24:12] Ironholds: hm... I see a big spike, but no obvious problems afterwards [14:24:36] milimetric, okie, I'll assume they turned it off and scold them. Thanks! [14:24:52] Ironholds: one sec, can you give me a full schema name? [14:25:24] Ironholds: nuria will help you, it's her week on-duty (we take turns every <> weeks) [14:25:31] awesome! [14:25:41] https://meta.wikimedia.org/wiki/Schema:MobileWikiAppSearch and https://meta.wikimedia.org/wiki/Schema:MobileWebSearch [14:25:48] mobrovac: ahem...it will be best if we could infer why from ops data, like load, memory.. etc [14:26:59] Ironholds, milimetric, I'm backfilling EL right now [14:27:14] mforns: thank you, i will take a look at incoming data [14:27:18] it may be the reason of the delay for new events [14:27:25] mforns, delayed since the 22nd? [14:27:31] no no [14:27:38] I started a couple hours ago [14:27:50] then it sounds like a different problem, but thank you! [14:28:02] the data stops, for both tables, at different points on 20151022 [14:28:08] Analytics-Backlog, Developer-Relations, MediaWiki-API, Reading-Admin, and 5 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1753045 (Nuria) >Could we add this MediaWiki-API-Error header to the webrequest table? Should I file a separate bug? We can... [14:28:22] milimetric, nuria it may be the change to batch size = 5000? It shouldn't... [14:28:29] mforns: oh yeah, that'll be the spikes [14:28:41] mforns: mmm, those schemas get events much faster than that [14:29:06] but even if they don't, they'd insert every 300 seconds due to the time limit [14:29:12] hey [14:29:31] thanks mobrovac for having restarted cass on aqs1001 [14:29:39] milimetric, yes! totally [14:29:43] I had already done that a couple time this morning [14:29:48] Ironholds: do you know that ebernhardson was doing changes to schema as of FRiday [14:30:02] Ironholds: the latest update had a validation issue [14:30:29] I'll truncate the old data --> milimetric, agreed ? [14:30:46] joal: how long would it take to load the old data daily? [14:31:08] milimetric: a few days [14:31:22] long to compute, and awfully long to load [14:31:27] Ironholds: they had to revert this change for things to work: https://gerrit.wikimedia.org/r/#/c/247853/ [14:31:29] really? [14:31:30] huh [14:31:39] ok, joal, i'd say we have no choice [14:31:48] we have to cut off the leg to save the body [14:31:53] truncate away [14:31:55] kinda :( [14:31:58] ok [14:32:14] joal: then don't worry about backfilling old data [14:32:20] let's finish loading the new schema daily [14:32:24] ok milimetric [14:32:26] since that was in progress already [14:32:35] Since aqs1001 was having trouble it actually failed loading ... [14:32:36] and we'll just switch the handler now instead of later [14:32:40] Will need to reatrt away [14:32:49] that's ok, but no sense loading the old data at this point [14:33:03] agreed [14:33:08] the new schema should load a little bit faster too [14:33:19] tiny silver lining!! They're always there if you look :) [14:33:29] !log truncating "local_group_default_T_pageviews_per_article".data on aqs [14:33:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [14:33:38] nuria, yes, I do [14:33:43] these aren't Discovery-maintained schemas [14:33:56] and those changes aren't to these tables [14:34:29] milimetric: my silverlining for today is the druid demo :) [14:34:42] :) [14:34:48] I'll email the web and apps teams [14:35:22] hm, nuria but I don't see those schemas in the errors dashboard: https://logstash.wikimedia.org/#/dashboard/elasticsearch/eventlogging-errors [14:35:54] the maps schema is /also/ broken [14:36:06] all these schemas were set up ahead of the revert to the 13th [14:36:21] that is, existed before the 13th [14:37:15] joal: when doing stuff like restarting (rb or cass) or something like that, you should also !log it in the #ops channel [14:37:21] Ironholds: hang on for just a sec [14:37:27] kk [14:37:30] I see events for those schemas in logs and the master db [14:37:33] checking replica [14:37:42] cc nuria ^ [14:38:12] milimetric: sorry, my internet went down, redaing [14:38:17] *reading [14:38:34] mobrovac: is cassandra supposed to be tier-1? [14:38:34] Ironholds / nuria: this is a master -> analytics-store replication problem [14:38:49] it started when jynus was doing his work, I think, October 22nd [14:38:59] milimetric: how do you look at master db? [14:39:01] Ironholds: master has all the events though [14:39:05] milimetric: ah from EL box? [14:39:07] nuria: eventlog1001 [14:39:08] yea [14:39:24] milimetric, awesome. Glad we haven't missed anything, but manually backfilling is a Pain so it'd be good to get the replag fixed ASAP. [14:39:30] milimetric: ya, i was trying to load tendril [14:39:42] Ironholds: replag is due to db maintenance [14:39:47] Ironholds: no backfilling needed [14:39:55] Ironholds: ah you mean in your end [14:39:58] yep [14:40:10] Ironholds: that is dbs work [14:40:13] I wouldn't claim to know how complex db backfilling is on account of not being a DBA ;p [14:40:27] so, who should I be bothering? [14:40:38] Ironholds: we're bothering jynus, no worries [14:40:51] Ironholds: replag here is 36 min [14:40:53] https://tendril.wikimedia.org/host/view/db1046.eqiad.wmnet/3306 [14:41:02] so i think something else is going on [14:42:10] nuria: nah, max(timestamp) for MobileWikiAppSearch_10641988 is now on m4 and October 22nd on analytics-store [14:42:23] replag must be reporting something else lower level [14:42:31] ok mobrovac [14:44:30] joal: milimetric: here's a cass dashboard for AQS - https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-system [14:45:14] oh damn, that looks like a horror movie [14:45:16] :) [14:45:25] Thanks mobrovac ! [14:47:45] we have a loot of these dashboards for cassandra [14:48:03] jsust look for dashboards named restbase :: cassandra in grafana [14:48:14] but these are for our rb cluster [14:48:37] you should be able to export/import them and then tweak the names by replacing restbase with aqs [15:02:14] Ironholds: how do you connect to db on 1002? [15:02:19] Ironholds: this: mysql --defaults-extra-file=/etc/mysql/conf.d/research-client.cnf -hanalytics-store.eqiad.wmnet [15:02:29] Ironholds: is giving me an error, did that chnaged? [15:02:32] *changed? [15:03:11] I don't know, I've set it as my .cnf so I just use -h analytics-store.eqiad.wmnet [15:03:45] Ironholds: waht do you have in your .cnf? [15:04:01] Analytics: AQS cluster Grafana dashboards - https://phabricator.wikimedia.org/T116590#1753105 (Eevans) NEW [15:05:33] nuria, try mysql --defaults-extra-file=/etc/mysql/conf.d/analytics-research-client.cnf -h analytics-store.eqiad.wmnet [15:05:44] Ironholds: ya, just saw that [15:15:48] Ironholds: you can follow up conversation with dba on #wikimedia-databases [15:16:44] hello! would someone here be playing with schema.Search? i'm getting a mediawiki internal error on test.wikipedia.org [15:17:12] niedzielski, it's being modified, yes. Check in #wikimedia-discovery ? [15:17:22] Ironholds: ok cool. thanks! [15:17:48] np! [15:22:29] Analytics-EventLogging: Exception caught inside exception handler - https://phabricator.wikimedia.org/T116593#1753312 (greg) a:EBernhardson [15:23:03] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753319 (mobrovac) >>! In T116247#1752974, @Ottomata wrote: > I'm still a little confused about how this reqid/id will work? Yo... [15:23:28] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753324 (mobrovac) [15:33:27] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753398 (Ottomata) > I don't see a conflicting problem with id (even though id is a JSONSchema keyword, but it relates to the sc... [15:33:58] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753399 (Eevans) >>! In T116247#1749452, @Ottomata wrote: > Right, but how would you do this in say, Hive? Or in bash? In bash... [15:34:43] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753400 (Ottomata) > Manual schema versions. We could increase the schema version every time we change something in the schema.... [15:37:37] Analytics-Backlog, Wikimedia-Mailing-lists, operations: Requests to lists.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116429#1753431 (Nuria) In order to get this requests in hadoop this domain needs to be fronted by varnish, by looking through pu... [15:39:13] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1753433 (Nuria) NEW a:jcrespo [15:41:03] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1753443 (Nuria) Since 201522 there is an issue with Eventlogging SQL data. Data is not replicating from master to slave for some tables. For example for table:... [15:41:55] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1753444 (Nuria) There is something beyond replication lag that has to explain this issue, as lag, at the time of writing of this ticket is of minutes, not days. [15:42:19] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1753447 (Nuria) [15:43:52] Analytics-Kanban, Database: Data missing from dbstore1002 but present in dbstore2002 - https://phabricator.wikimedia.org/T116600#1753454 (Milimetric) NEW a:jcrespo [15:44:24] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1753471 (jcrespo) [15:45:25] Analytics-Kanban, Database: Data missing from dbstore1002 but present in dbstore2002 - https://phabricator.wikimedia.org/T116600#1753477 (jcrespo) [15:46:23] (PS4) Joal: Update camus-partition-checker [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247847 [15:50:16] (PS5) Joal: Update camus-partition-checker [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247847 [15:50:20] Analytics-Cluster, Security-Team: Followup assessment for analytics cluster - https://phabricator.wikimedia.org/T116305#1753492 (csteipp) p:Triage>High [15:54:13] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753513 (mobrovac) >>! In T116247#1753398, @Ottomata wrote: > Ok cool, if that's the case, then `reqid` or even `request_id` (I... [15:55:06] (CR) Ottomata: [C: 1] Update camus-partition-checker [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247847 (owner: Joal) [15:59:52] mforns_lunch: standup? [16:00:10] nuria, omw [16:00:13] milimetric, ottomata madhuvishy standuppp? [16:00:34] AH [16:01:40] kevinator: standupppp [16:02:44] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753579 (Ottomata) > Hm, I think duplicates should be detected based on the content of the message itself and the time stamp. Ev... [16:29:20] (CR) Ottomata: [C: 2] Update camus-partition-checker [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247847 (owner: Joal) [16:29:48] (CR) Ottomata: [V: 2] Update camus-partition-checker [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247847 (owner: Joal) [16:35:49] Analytics: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. - https://phabricator.wikimedia.org/T116609#1753757 (ezachte) NEW [16:38:13] Analytics: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. - https://phabricator.wikimedia.org/T116609#1753790 (ezachte) hive query for {F2855318} USE wmf ; SELECT agent_type, project, access_method, day, sum(view_count) AS count FROM pageview_hourly WHER... [16:40:52] Analytics: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. - https://phabricator.wikimedia.org/T116609#1753799 (ezachte) webstatscollector 2.0 output: in stat1002:/mnt/data/xmldatadumps/public/other/pagecounts-raw/2015/2015-07> grep en.n projectcounts-201... [16:41:43] Analytics: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. - https://phabricator.wikimedia.org/T116609#1753802 (Ottomata) When you say 'squid logs' what do you mean? [16:48:01] Analytics: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. - https://phabricator.wikimedia.org/T116609#1753830 (ezachte) filtering 1:1000 sampled squid logs for wikinews html requests zcat sampled-1000.tsv.log-20150710.gz | grep 'wikinews' > ~/wikinews_2... [16:49:33] Analytics-Wikistats: Monthly page view stats for wikibooks, wikinews, wikiquote, wikisource, wikiversity for July 2015 are extremely anomalous - https://phabricator.wikimedia.org/T116531#1753835 (ezachte) [16:53:37] joal: ok, back to the cave :) [16:57:08] nuria, I'm back [16:59:39] Analytics: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. - https://phabricator.wikimedia.org/T116609#1753895 (ezachte) @Ottomata squid logs is 1:1000 sampled at stat1002:/a/squid/archive/sampled> [17:01:39] mforns: let me know if you can log into: elastic.analytics.eqiad.wmflabs [17:03:44] nuria, yes, I can log in [17:03:46] :] [17:08:45] ottomata: we'll need you to merge a puppet change as part of this deploy [17:09:03] https://gerrit.wikimedia.org/r/#/c/247458/ [17:09:08] Analytics-Kanban, WMF-deploy-2015-10-27_(1.27.0-wmf.4): Incident: EventLogging mysql consumer stopped consuming from kafka {oryx} - https://phabricator.wikimedia.org/T115667#1753935 (mforns) a:Milimetric>mforns [17:10:39] Analytics-Kanban, WMF-deploy-2015-10-27_(1.27.0-wmf.4): Incident: EventLogging mysql consumer stopped consuming from kafka {oryx} - https://phabricator.wikimedia.org/T115667#1729697 (mforns) Executing the backfilling right now. Seems to be working fine, will take couple hours. [17:12:14] Analytics-EventLogging, Editing-Department, Improving access, QuickSurveys, and 2 others: Schema changes - https://phabricator.wikimedia.org/T114164#1753952 (Jdlrobson) p:Triage>Normal [17:14:48] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1753983 (jcrespo) Question: the max of max is: 20151026170824 Does this mean that all data prior to this is there, or are events loaded out of order, independen... [17:17:19] Analytics: AQS cluster Grafana dashboards - https://phabricator.wikimedia.org/T116590#1753996 (Eevans) a:Eevans [17:19:42] hey mforns [17:19:44] you around ? [17:19:51] hey joal [17:19:52] yes [17:19:56] cool :) [17:20:07] :] [17:20:23] We are dploying refinery with milimetric and having looked at hue, we see a lot of coordinators for your tests still up [17:20:28] mforns: --^ [17:20:36] pppppfffff [17:20:40] D you want us to kill them ? [17:21:08] mmmmmmm, I though I had killed everything, just a sec, are you in da cave? [17:21:18] yup [17:21:18] joal, ^ [17:32:50] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1754104 (jcrespo) p:Unbreak!>High [17:37:29] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1754115 (Ironholds) All of our dashboards are broken and incapable of being fixed. This is definitely an "unbreak now" from the point of view of our team. [17:37:57] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1754120 (jcrespo) Setting it to high because: I do not think this is a new issue: on 29 sept, I made an audit of the eventlogging database and analytics-store (d... [17:38:26] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1754123 (GWicke) > If we adopt a convention of always storing schema name and/or revision in the schemas themselves, then we can... [17:38:59] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1754124 (jcrespo) @Ironholds when is the last time you knew your data was "fresh"? [17:40:08] Analytics: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. - https://phabricator.wikimedia.org/T116609#1754127 (Nuria) @ezachte: As I understand it the sampled logs and pageviews cannot be compared directly, the pageview definition does not count many of t... [17:44:58] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1754145 (GWicke) > I'm not so sure actually that these will always be redundant. I think the request ID should be persisted to t... [17:45:06] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1754154 (Nuria) >This has been like that likely for months, as the last reference of m4 there I can find is from 2015-08-28. No, this is been broken since the 22n... [18:00:10] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1754237 (jcrespo) If it is an unbreak-now, can someone please join me on #wikimedia-databases ? I have questions that need answers. [18:05:04] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1754253 (Ironholds) What Nuria said, basically. The idea of "use the codfw box for analytics" ignores the problem that not all services have /access/ to the codfw... [18:11:38] Analytics: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. - https://phabricator.wikimedia.org/T116609#1754283 (ezachte) @Nuria Right, I know actually, so yes 128 (x 1000) for sampled logs (255 - 127 CentralAuthoLogin) comes somewhat close to hive number fr... [18:15:48] Analytics: View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. - https://phabricator.wikimedia.org/T116609#1754298 (ezachte) This text in description is misplaced "From squid logs I get a totally different number yet again (see below)" as both chart above and b... [18:20:58] Quarry, Database: Unpatrolled edits don't work on plwiki - https://phabricator.wikimedia.org/T116631#1754331 (The_Polish) NEW [18:24:21] nuria: do you happen to know how one can raise the heap size limit in beeline? (using the env variable like for hive https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries#Out_of_Memory_Errors_on_Client does not seem to work) [18:24:44] ... madhu and i looked into this on friday for while (http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20151023.txt ) but didn't find a solution [18:25:14] (madhuvishy: ^) [18:27:19] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1754366 (mobrovac) a:mobrovac [PR 5](https://github.com/wikimedia/restevent/pull/5) proposes the schema definitions for the b... [18:30:38] Analytics: AQS cluster Grafana dashboards - https://phabricator.wikimedia.org/T116590#1754386 (Eevans) The following dashboards have been setup: * https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-cf-latency-rate-copy * https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-cf-row-column-size * ht... [18:31:14] Quarry, Database: Unpatrolled edits don't work on plwiki - https://phabricator.wikimedia.org/T116631#1754401 (Krenair) What? `select sum(rc_patrolled = 1), sum(rc_patrolled = 0) from plwiki_p.recentchanges;` shows some patrolled and some unpatrolled RCs. [18:31:24] Quarry, Database: Unpatrolled edits don't work on plwiki - https://phabricator.wikimedia.org/T116631#1754402 (Krenair) I very much doubt this is an issue with Quarry itself. [18:31:51] Analytics: AQS cluster Grafana dashboards - https://phabricator.wikimedia.org/T116590#1754406 (Eevans) Open>Resolved [18:35:13] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1754434 (chasemp) I added a rule to allow analytics to access mysql at 10.192.32.19/32 which is the long term secondary slave for replication of this data as I un... [18:36:43] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1754446 (jcrespo) Chase added network access from stats1003.eqiad.wmnet to dbstore2002.codfw.wmnet in order to be able to be used as a temporary replacement for d... [18:40:38] Analytics-Backlog, Developer-Relations, MediaWiki-API, Reading-Admin, and 5 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1754461 (Anomie) >>! In T102079#1753045, @Nuria wrote: >>Could we add this MediaWiki-API-Error header to the webrequest tabl... [18:40:42] HaeB: sorry, no, beeline was just an experiment we were doing but we are not using it formally [18:41:19] HaeB: madhuvishy is on vacation for couple days, but , as i said, beeline might not work so well quite yet. [18:41:32] milimetric: we forgot to log our actions when deploying :) [18:41:35] Doing now [18:41:44] !log refinery deployed on HDFS [18:41:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [18:42:18] !log refine bundle, pageview_hourly and projectview_hourly coord restarted [18:42:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [18:42:26] oh yea [18:42:31] :) [18:42:39] milimetric: In fact my son is in bed :) [18:42:48] HaeB: madhu is in India for 2 weeks, so it might be a while [18:43:08] joal: lemme check real quick in -databases, make sure nobody needs me [18:43:15] sure [18:43:30] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1754464 (Ironholds) And what pass/user do we use to access it? Because the 'research' user is getting precisely nowhere. [18:43:44] thanks nuria - FYI on friday, ottomata recommended using beeline to circumvent the problem i had encountered on hive (the unicode issues - and indeed it worked for that) [18:44:51] and i know beeline is still a bit experimental. that said, i understand that in the long run it will be recommended over hive " HaeB: sorry, we haven't experimented with beeline much, i'll play with it more next week and document it" [18:45:45] is ottomata on vacation too? he seems to know beeline well [18:49:07] Analytics-Backlog, Developer-Relations, MediaWiki-API, Reading-Admin, and 5 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1754485 (Nuria) >I see nothing relevant to the API developers (i.e. me) here that would require a new ticket. I think a tick... [18:49:52] HaeB: otto will be back tomorrow, but we're not supporting beeline officially yet or anything, it was just another useful tool we were trying to get comfortable with [18:50:12] joal: ok, done, you deploying right now? [18:50:24] nonono :) [18:50:30] Was backfilling log info ;) [18:50:34] milimetric: --^ [18:50:50] milimetric: currently douvle checking pageview and projectview runs [18:52:51] cool, i'm chewing my tuna [18:53:08] and trying to keep really bad things from happening with our dbs [18:53:34] milimetric: EL's DB in bad shape really ? [18:54:06] joal: "we have no idea how data gets into the research cluster (analytics-store)" [18:54:31] replication is down for months - but data's fresh as of last week for most table and up to the minute for others [18:54:33] hrmf, not really a bad shape, more magical one maybe .... [18:54:48] bad because there was talk of poking a hole through the firewall to master [18:54:52] from the research cluters [18:54:55] *cluster [18:54:58] and I'm like - uh, no [18:55:54] yeah makes sense [18:56:32] I think maybe there is miscommunication or two proposals, teh hole poked was to the secondary slave in codfw [18:56:51] which seems to be setup for this purpose(?) but was just never used I imagine [18:59:17] milimetric: your part of the deploy went ok :) [18:59:46] chasemp: I don't think that's the purpose of that secondary slave [19:00:08] I think that's there as a backup of the master in case it needs to be swapped out to keep up with EL inserting [19:00:22] and probably to manage this replication to dbstore1002 [19:00:30] I mean I don't know really but this is in dallas and very new [19:00:31] joal: yay, is that bad news for the rest of it? [19:00:45] so it can't really be have been part of an eqiad redundancy strategy from teh long term [19:01:08] chasemp: oh! ... super confused then, maybe I'm remembering wrong from sean, but I could swear he had multiple machines getting data from the EL consumer [19:01:30] yes maybe but not this one, and not for any firewall work today [19:01:42] jynus was trying to say, hey if this is broke use the secondary slave in dallas [19:01:46] and we have time to troubleshoot [19:01:54] but I think many wires are crossed as it's confusing [19:02:06] 1002 isn't broken [19:02:20] it's just not getting new data for less than 1/3 of the tables [19:02:23] well, it's something :) [19:02:26] which is ok-ish [19:02:30] it's not great [19:02:47] gradient of brokenness aside :) that was the story w/ hole poking [19:02:57] milimetric: Forgot one step in the deploy of my stuff , so need to correct / double check [19:03:02] thx joal [19:03:10] np milimetric :) [19:03:11] thx chasemp, I'll keep listening in -databases [19:03:28] Now, aggregator merge and puppet sync of files to dump [19:03:44] Also milimetric , do you want us to backfill the pageviews files ? [19:03:53] joal: no [19:03:58] there was no promise of that happening [19:04:05] ok cool [19:10:09] Analytics-Backlog, Developer-Relations, MediaWiki-API, Reading-Admin, and 5 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1754554 (Anomie) >>! In T102079#1754485, @Nuria wrote: >>I see nothing relevant to the API developers (i.e. me) here that wo... [19:10:51] milimetric, nuria: ok, then i'll ping andrew to see if he happens to know about the heap size limit, and also look out for madhu's documentation for beeline that she is planning to start this week... [19:11:06] ...aprt from that, it seems we're back to solving the unicode problem in hive [19:11:45] HaeB: sorry to get your hope up but I do not think we will do beeline docs this week, there are some other pressing matters that requirte our attention [19:11:57] *hopes up, sorry [19:12:23] HaeB: we shall let you know when we support beeline and there are docs and such [19:15:36] thanks nuria, i know you have other pressing issues and i appreciate the work on those. i was just repeating what other members of the Analytics team told me earlier ;) [19:16:39] ...if beeline is on the backburner though, we really should solve that unicode issue for hive. russian is not a minor language, and if we can't retrieve pageviews for articles in that lanugae reliably from hive, that's a significant problem [19:16:54] HaeB: Right. What we are working on is here: https://phabricator.wikimedia.org/tag/analytics-kanban/ [19:17:22] do you recommend to submit a task on phabricator? [19:17:41] HaeB: anything outside kanban when we say (any of us) "next week" it's probably more like "at some point in the future" [19:18:16] HaeB: I'm sadly not familiar with the unicode issue. is it when you query on the commandline you can't use russian article titles? [19:18:33] HaeB: submitting task will be best yes, thank you! [19:18:41] milimetric: the value of the page title field is garbled [19:19:13] HaeB: it's garbled in the output? [19:19:48] sorry - i won't make you explain it twice, HaeB, I'll check out the task when you file it [19:20:13] also for french, e.g. https://fr.wikipedia.org/wiki/Wikip��dia:Accueil_principal instead of https://fr.wikipedia.org/wiki/Wikipédia:Accueil_principal [19:20:39] ok, will do [19:28:20] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1754642 (jcrespo) @Ironholds - I have not created yet the access. If it is active, it will be with the same credentials as you are currently using. However, @mili... [19:34:26] Analytics-Backlog, Database: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#1754659 (jcrespo) Here is the thing: we need to put dbstore1002 up to date, because it is a phantom slave (scripts in terbium update them incrementally). This should fix: T116599 [19:42:59] Analytics-Kanban: Enforce policy for each schema: Sanitize {tick} [8 pts] - https://phabricator.wikimedia.org/T104877#1754692 (jcrespo) [19:43:00] Analytics-Kanban, Database: Delete obsolete schemas {tick} [5 pts] - https://phabricator.wikimedia.org/T108857#1754690 (jcrespo) Resolved>Open We should do this also on dbstore1002, as there is no replication link between the servers. [19:43:30] Analytics-Kanban, Database: Delete obsolete schemas {tick} [5 pts] - https://phabricator.wikimedia.org/T108857#1754693 (jcrespo) Open>Resolved Done. [19:43:31] Analytics-Kanban: Enforce policy for each schema: Sanitize {tick} [8 pts] - https://phabricator.wikimedia.org/T104877#1430750 (jcrespo) [19:45:25] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1754698 (Ottomata) > If we have a use case for emitting two secondary events *to the same topic* that were both triggered by the... [19:47:19] Analytics-Backlog: Coorect oozie send_error_email workflow file name typo - https://phabricator.wikimedia.org/T116649#1754699 (JAllemandou) NEW [19:47:24] Analytics-Backlog, Wikimedia-Mailing-lists, operations: Requests to lists.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116429#1754706 (Dzahn) Ok, let's not mix up dumps. and lists. in a single ticket please. They are different and unrelated. I'm... [19:49:25] hey a-team, going for diner, will come back double check jobs are ok in a while, and then to sleep :) [19:49:30] See you all tomorrow [19:50:09] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1754709 (Ottomata) What do y'all think about keeping these 'framing' fields in a nested object? I'm not sure if this is a good... [19:51:45] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1754715 (jcrespo) I found the error on one of springle's screen sessions: ``` db1047 GatherClicks_12114785 >= 20151022150117 (rows!)ERROR 1136 (21S01) at line 19... [20:01:52] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1754776 (Milimetric) An approximate list of tables that were affected are mentioned in the incident, in case that helps with troubleshooting / correlating the alt... [20:18:40] Analytics-Backlog, Analytics-EventLogging: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1754873 (jcrespo) db1047 is also involved, doing the changes there, too. [20:19:25] Analytics-Kanban, Database: Delete obsolete schemas {tick} [5 pts] - https://phabricator.wikimedia.org/T108857#1754874 (jcrespo) Resolved>Open And on db1047 too! [20:19:26] Analytics-Kanban: Enforce policy for each schema: Sanitize {tick} [8 pts] - https://phabricator.wikimedia.org/T104877#1754876 (jcrespo) [20:21:14] Analytics-Kanban: Enforce policy for each schema: Sanitize {tick} [8 pts] - https://phabricator.wikimedia.org/T104877#1430750 (jcrespo) [20:21:15] Analytics-Kanban, Database: Delete obsolete schemas {tick} [5 pts] - https://phabricator.wikimedia.org/T108857#1754877 (jcrespo) Open>Resolved I already did it to db1047. I need to sleep more. [20:22:30] Analytics-Backlog, Database: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#1754881 (jcrespo) Also has to be applied independently on db1047. [20:29:17] a-team, for you to know in case you look at it: pageview_hourly oozie job fails, but it's a false positive [20:29:44] I need to double check with Andrew why, it says there are issues with SMTP [20:30:38] Analytics-Backlog, Database: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#1754904 (jcrespo) Which I just did. [20:32:05] hm, weird [20:32:15] indeed milimetric [20:32:26] i'm a little exhausted today, but i need to write puppet to sync the results to dumps [20:32:39] milimetric: can wait tomorrow :) [20:32:46] and finish the restbase flat change [20:32:54] kk [20:33:00] we need to enable puppet anyway [20:33:07] *merge [20:36:40] milimetric: the detailed error of the email failure is bizarre: milimetric@wikimedia.org does not exist ! [20:36:58] uh...... [20:37:01] that's true - it doesn't [20:37:06] but .... why is that ..... [20:37:30] (CR) Jdlrobson: [C: -1] "Florian I actually think a better test for back to top would be to compare sessions with a click to back to the top. I dont think click tr" [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/247045 (https://phabricator.wikimedia.org/T98701) (owner: Florianschmidtwelzow) [20:37:31] ah: https://github.com/wikimedia/analytics-refinery/search?utf8=%E2%9C%93&q=milimetric [20:37:32] heh [20:37:44] Maaaan, what a shame :) [20:37:48] Ok I think I get it [20:37:52] So ... [20:37:52] joal: https://github.com/wikimedia/analytics-refinery/search?utf8=%E2%9C%93&q=milimetric [20:38:13] and milimetric@wikimedia doesn't exist [20:38:17] ye [20:38:18] s [20:38:19] rooo sh*t [20:38:22] heh [20:38:31] joal: let's make it! [20:38:33] I'd love that alias [20:38:38] I'll write OIT a ticket [20:38:46] :) [20:38:48] Awesome [20:39:06] It's not the way I'd have thought for bug correction :) [20:39:14] But it hsould ! [20:39:16] do [20:39:17] dude, akham's razor [20:39:19] I cut knots like a pro [20:39:27] :D [20:39:48] Also found a previously introduced but [20:39:50] bug [20:40:39] cool, ticket filed, so that should be done soon I think, I mentioned it was a bit urgent [20:42:27] ok, another bug found ... [20:44:53] (PS1) Jdlrobson: Bump MobileOptions revision [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/248990 (https://phabricator.wikimedia.org/T115129) [20:45:32] milimetric: i see we're still using limn any plans to move away from this anytime soon? :-/ [20:48:09] hm, milimetric, I'm gonna correct that now, would you mind reviewing ? [20:49:49] And by the way milimetric, I can change the email from the code (doing some changes now) [20:49:58] Let me know what you think is best [20:50:08] jdlrobson: we changed the underlying report updater, that's *much* better now. So the limn part is just the dashboard json pointing to the files and the reports are just the SQL, I didn't think there was a problem with it. I'm happy to change it if there are problems though... [20:50:21] joal: i can review, sure [20:50:49] joal: yeah, probably best to change the email since that alias isn't used anywhere [20:50:58] good to have it though, just to be safe [20:51:14] milimetric: maybe it's just become too easy :) - i was trying to work out how to do https://gerrit.wikimedia.org/r/248990 [20:51:20] but maybe what i've done is all i need to do [20:52:34] Analytics-Kanban: Correct oozie bugs - https://phabricator.wikimedia.org/T116649#1755058 (JAllemandou) a:JAllemandou [20:52:50] Analytics-Kanban: Correct oozie bugs [3 pts] {hawk} - https://phabricator.wikimedia.org/T116649#1754699 (JAllemandou) [20:53:01] oh is this the whole awful schema changing thing? nothing's really going to help with that except fancy db work. limn really has nothing to do with that part [20:53:13] jdlrobson: ^. And looks good to me, want me to merge? [20:53:38] the schema change is all you need. The new reportupdater will keep your old data and start querying the new data [20:54:04] Analytics-Backlog, Analytics-Kanban, Developer-Relations, MediaWiki-API, Research-and-Data: Add Application errors for Mediawiki API to x-analytics - https://phabricator.wikimedia.org/T116658#1755070 (Nuria) NEW a:bd808 [20:54:14] (PS1) Joal: Correct two bugs in refinery oozie jobs [analytics/refinery] - https://gerrit.wikimedia.org/r/248991 (https://phabricator.wikimedia.org/T116649) [20:54:19] milimetric: --^ [20:54:23] jdlrobson: you should probably change this one too: https://gerrit.wikimedia.org/r/#/c/248990/1/mobile/mobile-options.sql [20:55:25] Analytics-Backlog, Developer-Relations, MediaWiki-API, Reading-Admin, and 5 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#1755087 (Nuria) @spage: added a (sub) task : https://phabricator.wikimedia.org/T116658 to keep things organized. [20:55:33] (CR) Milimetric: [C: 2 V: 2] Correct two bugs in refinery oozie jobs [analytics/refinery] - https://gerrit.wikimedia.org/r/248991 (https://phabricator.wikimedia.org/T116649) (owner: Joal) [20:55:47] Awesome, you rock milimetric :) [20:55:55] I'll deploy that right now [20:56:04] for verifying my name is spelled correctly? [20:56:10] * milimetric doubts what standards he's being held to [20:56:11] huhuhu :) [20:56:11] :P [20:56:35] arf, a gt insteag of a eq, and a file rename :) [20:59:38] (CR) Milimetric: [C: -1] "The reportupdater will keep your old report values that are already in the dashboard intact while using the new queries to create whatever" (2 comments) [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/248990 (https://phabricator.wikimedia.org/T115129) (owner: Jdlrobson) [21:00:10] milimetric: so i guess my issue with limn is i don't work on it too often so when i do there is always a learning curve to remind myself these things :/ [21:00:57] milimetric: reading your review sounds like i need to remind myself join syntax :) [21:01:21] jdlrobson: one sec, I'll give example there [21:03:35] (CR) Milimetric: "you can remove the union all when events stop flowing into the old schema." (1 comment) [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/248990 (https://phabricator.wikimedia.org/T115129) (owner: Jdlrobson) [21:04:22] jdlrobson: and yes, the learning curve in this case is SQL and I'm not loving the fact that people have to write SQL to get information about their data. We're playing with better exploratory UIs. [21:04:52] but no matter what, someone's going to have to configure something somewhere, and SQL is pretty light weight compared to mountains of JSON and nightmares :) [21:06:26] Analytics-Backlog, Analytics-EventLogging, Database: Eventlogging tables not replicating from master to slave - https://phabricator.wikimedia.org/T116599#1755118 (jcrespo) [21:08:36] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1755124 (GWicke) >>! In T116247#1754698, @Ottomata wrote: >> If we have a use case for emitting two secondary events *to the sam... [21:10:02] https://www.irccloud.com/pastebin/sOk4pOF5/ [21:10:10] milimetric: ^ what's wrong with the above? [21:12:29] jdlrobson: that's a fine join by itself, but if you use it as an inner query somewhere else it has to be in a set of parens outside all of it and it needs an alias like the "temp" I gave in the example [21:12:39] *union not join [21:13:41] if you paste the whole query i can get it to work... [21:14:20] https://www.irccloud.com/pastebin/t0QCJiEe/ [21:14:22] ^ milimetric [21:14:43] I get Illegal mix of collations for operation 'UNION' [21:15:19] oh, interesting, the problem is there's no alias, but one sec i'll fiddle [21:17:21] jdlrobson: yeah, just add an alias like this: [21:17:40] https://www.irccloud.com/pastebin/ewbn3cSN/ [21:19:59] Analytics-Backlog, Analytics-Cluster: Capacity planning for cluster - https://phabricator.wikimedia.org/T116661#1755170 (Nuria) NEW [21:46:19] nuria: you around ? [21:49:05] mforns maybe ? [21:49:15] yes :] [21:49:17] joal, [21:49:22] Heya :) [21:49:26] hey [21:49:33] I was wondering where you were with ES testing :) [21:49:54] mmm, we managed to get ES running in our instance [21:49:59] but no data yet [21:50:14] I am watching a video right now to help me [21:50:24] we tried docker without luck though [21:50:33] oh :( [21:50:38] we ended up executing naked ES [21:50:41] should work as well :) [21:50:45] ok [21:50:47] cool [21:51:46] thanks for heads up :) [21:52:17] we were working in parallel (me with EL backfilling and Nuria speaking to jaime crespo) [21:52:27] right [21:52:36] so had no time to really dive into ES [21:52:46] but I'll work on this tomorrow morning [21:52:51] I'll have time tomorrow if you want, I might be able to help :) [21:52:57] ok [21:53:00] Let me know :) [21:53:08] I guess the principle is the same as cassandra no? [21:53:24] explode the cubes and store in key-value pairs [21:53:30] hm, not completely: ES provides aggregation function [21:53:38] aaaahmmmm [21:53:51] ok [21:53:59] will look at that [22:09:06] Hm.. could someone look at this simple kafka consumer script to spot what could be wrong? It's not working as expected. https://gist.github.com/Krinkle/30feada545adfbe32478 [22:09:18] I'm looking to intercept a few statsv packets to debug an issue [22:09:32] primarily full url and user-agent [22:19:46] Passing SimpleConsumer(kafka, 'statsv-krinkle', 'statsv') results in kafka.common.OffsetOutOfRangeError: FetchResponse(topic='statsv', .. ) [22:20:31] Analytics-EventLogging, Editing-Department, Improving access, QuickSurveys, and 3 others: Schema changes - https://phabricator.wikimedia.org/T114164#1755460 (Jdlrobson) [22:21:27] madhuvishy: Is there a simple example of how to use kafka? (specifcally SimpleConsumer, though another interface would be fine too). It seems the existing examples don't work, or for other reasons result in errors when invoked on stat1002 [22:24:38] milimetric: :) [22:27:57] Analytics-EventLogging, Editing-Department, Improving access, QuickSurveys, and 3 others: Schema changes - https://phabricator.wikimedia.org/T114164#1755492 (Jdlrobson) [22:28:04] Analytics, Reading-Admin, Zero: Country mapping routine for proxied requests - https://phabricator.wikimedia.org/T116678#1755493 (dr0ptp4kt) NEW [22:28:55] Krinkle: madhu's în India for two weeks so she won't be around normal hours [22:29:37] OK. I only asked because she wrote a lot of the documentation around it. Any Kafka users here? [22:30:07] This seems like beginner's 101. I just can't get it to work. I suspect some kind of wmf-specific quirk or some configuration error on stat1002 [22:31:01] Yes, I just use kafkacat usually, or the console output consumer. I'm out at dinner but ill try to checvk it out later [22:33:17] * Krinkle looks into kafkacat [22:35:24] milimetric: Thx. kafkacat + grep works well. [22:37:14] cool, good. Krinkle: remember with kafkacat you're consuming usually from a specific partition of a topic (if you don't specify one, it uses one by default, I forget which one). So you're only getting 1/12th of the data that way (I think most topics have 12 partitions). [22:37:45] milimetric: Right. Is the distribution random-ish? [22:37:49] yes [22:37:58] OK. Then it's good enough for me. [22:38:09] I'll take samples from two or three different ones to be sure. [22:38:10] Thanks! [22:40:08] Krinkle: did you get help? [22:40:19] Krinkle: kafkacat is what we normally use [22:40:51] nuria: It doesn't appear to be documented though [22:41:01] Krinkle: kafkacat? [22:41:17] Searching for the name kafkacat (now that I know about it) yields 2 casual mentions on unrelated pages on wikitech. [22:41:22] But nothing besides that [22:41:28] Krinkle: we did not made it https://github.com/edenhill/kafkacat [22:41:31] there's no way that from looking at documentation I would've found out to use that. [22:41:59] Krinkle: ah sorry, but it is just a debugging tool , like many others [22:42:08] nuria: I know, I'm not aksing to document how it works, but that it exists and is the preferred tool / the tool that actually works / is installed /aailable/supported/recommended. [22:42:52] I'm still curious why the non-debugging tool (kafka simpleconsumer) doens't work though. [22:43:10] which also isn't documented. And the example (analytics-statv.py) doesn't appear to work. [22:44:05] Analytics, Reading-Admin, Zero: Country mapping routine for proxied requests - https://phabricator.wikimedia.org/T116678#1755593 (Yurik) Adding @bblack - as it might be Varnish that might need to be adjusted for proper traffic tagging if it comes via IORG [22:45:36] Krinkle: sorry but kafkacat is just a convenience, we do not particulary endorse it, we certainly could document it exists but it is not often that people want to consume from kafka directly other than inside our team, not really a use case we have had to date [22:47:34] Krinkle: maybe you are missing the port? [22:47:47] https://gist.github.com/nuria/01fef56a8a69528fee93 [22:48:00] I'm bringing it up because there appears to be a shift toward letting developers (outside analytics) "own" their own stuff and be less reliant on needing analytics engineers for day-to-day stuff (like adding new EL schemas and consuming the data programmatically etc.). I'm happy to do these things myself, but can't without there being some basic level [22:48:00] documentation about getting things to work. [22:48:23] Krinkle: understood, that seems totally fair [22:48:40] Krinkle: as i said it is not really common that people want to consume from kafka directly [22:48:48] I'm happy to wait a few days and file tasks instead, but it seems desirable to off load that from you guys and do it myself [22:48:50] Krinkle: let me add an entry to wikitech [22:48:55] I was mimicking this code btw https://github.com/wikimedia/analytics-statsv/blob/master/statsv.py#L80 [22:48:57] Krinkle: try my gist [22:52:09] nuria: [stat1002] python> ImportError: No module named avro.schema [22:52:23] ah sorry, remove avro plis [22:52:33] trying without [22:53:01] nuria: KafkaConsumer seems to work indeed. Thanks [22:53:26] I think is just the port maybe? This code is simpler [22:56:42] Krinkle: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Kafka#Produce.2FConsume_to_kafka [23:00:43] nuria: thx. [23:01:34] nuria: for future reference, you want
 instead of  for general blocks of code.  is inline (like bold) and doesn't work well across multiple lines.  also works of course, if it is all in a common language that mw recognises.
[23:02:12] 	 Krinkle: i added fancy syntaxhighlite with little hope that it would work .. and it did!
[23:02:20] 	 Yeah, that also works :)
[23:03:32] 	 I'm not sure I understand the 'cat' before 'kafkacat'. I used 'kafkacat' directly to consume data. I assume the pipe is to send data?
[23:07:33] 	 cat is to produce
[23:07:45] 	 cat "some file" | kafkacat -someparam
[23:07:56] 	 will send data to kafka