[14:55:01] hi jynus [14:55:03] let me see select max(timestamp) from log.MobileWikiAppSearch_10641988; [14:55:07] ooh, nice quiet channel :) [14:55:15] thx [14:55:43] I get 20151026140855 [14:55:50] which is now - lag [14:55:53] not 22 [14:55:59] on m4, right? [14:56:03] yep [14:56:10] what about on analytics-store? [14:56:19] https://www.irccloud.com/pastebin/EhYfBU7W/ [14:56:19] that is analytics store [14:56:29] I'm so confused :) [14:56:38] let me check chat do you call analytics store [14:56:43] m4-master.eqiad.wmnet is not the same as analytics-store.eqiad.wmnet in my mind [14:57:00] m4-master is db1046 [14:57:08] analytics store is dbstore2002 [14:57:12] its slave [14:58:07] right, so dbstore2002 has 20151022145533 as the max timestamp there (see my query output) [14:58:24] not for me [14:58:36] I got 20151026140855 [14:58:43] are you inside a transaction? [14:59:15] no, that output is right after connecting to the box [14:59:24] I've just run "select max(timestamp) from log.MobileWikiAppSearch_10641988" on the slave, not on m4-master and I got that [14:59:43] let me try again [14:59:44] weeeird... hm, is there something magical happening when I access analytics-store from stat1003? [14:59:50] mmm [14:59:53] wait [15:00:04] maybe the proxy is redirecting you to another host? [15:00:08] lets try something [15:00:14] SELECT @@server_id; [15:00:15] the reason this matters is because 100% of the users of EL data run queries that way [15:00:17] ^do that [15:00:19] k [15:00:37] 171978770 [15:00:59] ok, that is the wrong host [15:01:10] let me see where are you querying [15:01:16] mysql --defaults-file=/etc/mysql/conf.d/research-client.cnf -hanalytics-store.eqiad.wmnet [15:01:45] the cnf file just has research/<> in it [15:02:22] I have no idea where are you queriing, let me do a seach [15:03:05] the command above ^ should connect to whatever stat1003 thinks is "analytics-store.eqiad.wmnet" [15:03:38] ah, gotcha [15:03:41] dbstore1002.eqiad.wmnet [15:03:55] but I thought there is no replication running there [15:03:58] let me see [15:04:21] so first, this is unrelated to replication lag [15:04:28] so, I'm not sure why "analytics-store.eqiad.wmnet" refers to two things, but that sounds confusing [15:04:33] that is why I was being misleading [15:04:40] *misslead [15:04:45] secondly, this dbstore1002 is where everyone uses EL data from [15:04:48] I seee [15:06:00] it is strange, because there is no replication runing there of m4 [15:06:11] heh, that sounds like the problem :) [15:06:56] ooh, even weirder: select max(timestamp) from Edit_13457736; [15:07:05] that gets a recent time [15:07:13] well, there is an easy fix, which is redirecting analytics-store to 2002 [15:07:43] mmmmm... that sounds like a very bad idea without first understanding what's going on here [15:08:07] 1002 is where literally thousands of queries are run all the time, that's where all the researchers and dashboards and everything run [15:08:09] probably the dns was changed? [15:08:25] and if 2002 is something else, not prepared for that, it might kill everything [15:09:04] let me check dns changes [15:09:10] or the proxy [15:17:32] there is no reference of of m4 since 2015-08-28 [15:18:46] ok, so let me quickly generate a select max(timestamp) for each table on there and we can maybe see a bigger picture [15:28:13] ok jynus, so of the 300+ tables, about 80 have max(timestamps) that stopped at 20151022 [15:28:30] of those, the vast majority (all but 5 or 6) stopped sometime around 15:00:00 [15:28:34] strange, very strange [15:28:54] this looks like more than a coincidence [15:29:19] question is not many people has RESET MASTER privileges [15:29:39] just you, right? :) [15:29:47] either which is required to do that [15:30:04] or it is a very bad bug [15:30:20] we have seen really strange replication bugs in the past (Sean can attest to that) [15:30:24] one thing I know is that I did not stopped replication there, and less reseted it [15:30:31] not like this one [15:30:33] oh I fully believe you [15:30:36] not blaming you :) [15:30:56] replication can break [15:30:58] I'll make an incident report, is there more you can do to dig in? I think this is beyond me [15:31:03] that is a non-issue [15:31:11] but it cannot simply dissapear [15:31:42] well, replication to dbstore1002 is definitely happening, as we've been talking the max timestamps on the other tables are increasing [15:31:46] so that replication is setup somehwere [15:33:58] hey I'll make a phab ticket for this too [15:34:25] jynus: do you know if filipo knows about this? From sean? [15:34:42] is springle there/ [15:35:03] jynus: let's write on a ticket what is going on [15:36:16] jynus: cause it is a sytem that is been working for couple of years and maybe recent changes have break it, my idea is that we can get more eyes on teh issue if we write it on a ticket? [15:36:24] jynus: are you OK with that? [15:36:34] jynus: should be "unbreak now" priority [15:36:50] why filipo? [15:37:15] create the ticket, I will look at it asap [15:38:57] jynus: no idea, I thought sean was telling filipo stuff before you joined, but i could be totally misremembering [15:39:52] I understand it is an issue, I just refuse to think things just disappear, and maybe it is just a name problem [15:41:48] jynus: what project should i tag it with [15:41:57] Databases [15:42:30] jynus: https://phabricator.wikimedia.org/T116599 [15:42:48] milimetric: i just created ticket milimetric see [15:42:52] question is dbstore2002 is preciselly there for these cases (everything broken) [15:43:21] so let me check if we have to go there or we can fix fast the dbstore1002 issue [15:44:17] nuria: argh :) [15:44:19] I made the ticket [15:44:24] and the incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20151022-EventLogging [15:44:33] do not worry, will take care [15:44:36] milimetric: SORRY, [15:44:36] https://phabricator.wikimedia.org/T116600 [15:44:42] jynus: if you can merge both [15:44:45] In one way or another [15:44:45] no nuria, I should butt out, this is your week [15:44:46] jynus: thanks [15:44:47] my bad [15:45:20] nuria: do you mind sending a note to analytics-l? People need to know they'll have to rerun reports for the past five days at least, once we push this data to dbstore1002 [15:45:29] milimetric: will dso [15:45:31] *do [15:46:04] also, not only data failed, monitoring too, which is very unlikely [15:53:19] I have a meeting, will continue with this in 1 hour [17:10:51] milimetric, this does not need an incident report [17:10:59] it is not an outage [17:11:16] that doesn't mean it is not important [18:05:49] jynus: we consider any EL drop of data as an outage [18:06:16] jynus: as the whole wmf product oraganization needs that data [18:06:36] jynus: this is the protocol we agree with ops (and springle) a while back [18:06:49] I am ok with that [18:07:00] (although I didn't know it) [18:07:08] jynus: sorry about that [18:07:15] jynus: are you in analytics@ [18:07:18] but I need someone here to ask questions [18:07:20] live [18:07:26] jynus: that would be me!! [18:07:41] if it is an unbreak now that should be feasable [18:07:45] someone had questions about the EL dbs? [18:07:50] ah, there ya are [18:08:31] so, you confirm me that you could find fresh data until last week? [18:08:36] jynus: yes [18:08:46] because I can confirm you that replication hasn't worked for months [18:09:11] jynus, yes, there is fresh data as of the 22nd [18:09:18] although I can see queries to dbstore1002 recently [18:09:25] you have multiple people saying this multiple times, including providing the queries to the db in the initial bug report [18:09:31] I promise that there is data from 22 October in there ;p [18:09:39] no, I know that [18:09:46] jynus: what about data from the 22nd? [18:09:58] question is replication is not-working for a long time [18:10:13] something has been inserting data there outside of replication [18:10:13] jynus: if replication is broken how did it get there? (asking honestly, i have no clue) [18:10:24] jynus: ahhhh, that can be [18:10:40] jynus: maybe springle [18:10:46] set it up that way? [18:10:49] and it may be a reason why dbstore is not read_only [18:10:54] til this channel -- jynus whehter i can help depends on the issue complexity or waht broke is network wise, but I will try if I can get idea of what you need [18:11:09] chasemp, I need something different from you [18:11:16] as a temporary issue [18:11:22] can you puch a chole in the vlans [18:11:34] from stats1003.eqiad [18:11:48] to dbstore2002.codfw [18:11:52] on 3306 [18:12:15] that way people will be able to work until we find something more permanent [18:13:02] that is 10.64.36.103 to 10.192.32.19 [18:13:19] jynus: ping me if you need anything i will be here all PST day [18:13:29] ok, I will [18:13:35] let me look at whether I understand the how and can make sense of it, your thought is that resolves https://phabricator.wikimedia.org/T116599#1754253 [18:14:15] there is a problem with dbstore1002 lacking information, it is not trivial why [18:14:29] we have another slave on codfw with all data [18:14:41] that could do the job [18:14:57] this is analytics, research and mobile traffic, not wiki [18:15:31] is dbstore2002 just the codfw equivalent of dbstore1002? [18:15:32] it doesn't resolve the issue, but it allows people to work until I fix it permanently [18:15:39] Krenair, yes [18:15:44] analytics-store.eqiad.wmnet is an alias for dbstore1002.eqiad.wmnet. [18:15:51] Host analytics-store.codfw.wmnet not found: 3(NXDOMAIN) [18:15:56] but there is no access from stat3 [18:16:11] does it make sense what I want to do? [18:16:12] which is what you're trying to get fixed right? :) [18:16:15] I think so [18:16:16] right [18:16:43] access is allowed on iptables, so it must be a network problem (which is expected) [18:18:28] regarding the actual issue- I can recreate the replication, but I need to point in time in which replication stopped [18:18:57] but I do not think this is replication related- I think the client used to update both servers at the same time, and it stopped doing so [18:19:14] whatever changed, it changed last Thursday apparently [18:19:21] sadly, I do not have a binary log to prove it [18:19:40] Nuria, you mentioned "stopping the backfilling" [18:20:02] what do you meant with that, what exacltly did you do? [18:20:04] jynus: yes, that is a python process [18:20:13] jynus: running on an eventlogging node [18:20:45] so, you killed a process there, do you do it often or just this time (maybe 2 processes run usually?) [18:20:56] I am speculating here [18:21:22] jynus: that is just a python process we start when we need to add data to database from logs [18:21:32] jynus: from an eventlogging node. [18:21:33] can you point me to it [18:21:41] as in which server, which process [18:21:50] jynus: but none on my team has permits to modify database configuration [18:21:56] yes [18:22:08] that it what I think, I think this is not a replication issue [18:22:20] jynus: it's an eventlogging consumer, not a db process, see: [18:22:20] I think a client process was run duplicating it on both nodes [18:22:49] which would make sense [18:23:35] or maybe a proxy issue [18:23:52] jynus: not a client process from eventlogging, that is a service that doesn't know any thing about replication [18:23:57] jynus: box: deployment-eventlogging03.eqiad.wmflabs [18:24:09] jynus: you can do ps -auxfw to see processes [18:24:30] jynus: db config is on /etc/eventlogging [18:25:13] etc/eventlogging.d/consumers [18:25:34] as you can see it only knows about master, this is no db configuration [18:27:35] jynus: so, no, not related to this, as i said eventlogging is a user of the database [18:27:58] so here it is the thing [18:28:12] I can start replication (actually, configuring it for the first time) [18:28:28] but I cannot assure the integrity of the data [18:28:47] because I do not know where to start replicating [18:28:52] jynus: i would rather we try to fix the process that was loading this data before the 22nd [18:29:22] jynus: if possible [18:29:34] but data, afaik is ok on m4-master and dbstore2002, isn't it? [18:30:08] jynus: ok now try [18:30:42] as far as i know is on db1046 which is m4, i do not know what dbstore2002 is [18:30:50] chasemp, thank you thank yuo thank you [18:31:14] it is another slave I prefer to give you access too, because it has not been "broken" [18:31:24] it works [18:31:40] we can even change the dns to point there automatically [18:32:10] at least unless we can either reconstuct or have a good understanding about what happened [18:35:56] jynus: but , then the processes running on 1002 that were getting data from the slave , will they work? [18:36:10] jynus: that is the main thing broken [18:37:11] jynus: I believe every dashboard and process on 1002 is connecting using this configuration (managed by puppet) [18:37:12] mysql --defaults-extra-file=/etc/mysql/conf.d/analytics-research-client.cnf -hanalytics-store.eqiad.wmnet [18:37:17] on 1002 [18:41:44] analytics-store.eqiad.wmnet is dbstore1002 [18:42:09] we will have to change it temporarilly to dbstore2002.codfw.wmnet [18:46:08] jynus: I haven't read everything above but switching dbstore2002 and dbstore1002 needs to be thought about carefully. Are we sure that both are exactly the same and have other databases that are on dbstore1002 like "staging"? That's a crucial db for most of the analytics as they stage data in there and crunch further [18:46:30] also, not sure if whatever's inserting data into 1002 is filtering anything private that shouldn't make it across [18:46:46] for now, it's fine if 1002 is just outdated, that doesn't need "unbreak now" priority [18:47:02] people can re-do reports and be a week or more late, that's nothing new at WMF [18:47:04] milimetric, there is that, or having a database that may or may not have all data [18:47:05] * Ironholds blinks [18:47:10] milimetric, okay, now take that statement [18:47:15] and make it to the Vice-President of Discovery [18:47:20] and let's see if he agrees with you [18:47:30] these are not ad-hoc reports, these are regular and consistent dashboards. [18:47:40] This is also a block on ad-hoc reports, of course, like /all of our A/B tests/. [18:47:42] access should be granted anyway [18:47:46] Ironholds: fair enough, but I can present the alternative - open an unknown security hole and remove access to existing databases like "staging" which breaks a bunch of other queries [18:48:09] at least people have the option to use it [18:48:13] Ironholds: I'm happy to talk to the VP of Discovery and explain that we need more people helping jynus so he's not dealing with work that should be done by 10 people :) [18:48:19] solving the underlying bug is also an alternative. Don't get me wrong, I'm not saying "it must be done Right Now" [18:48:29] I appreciate bugfinding and bufixing takes time [18:48:35] Ironholds, I do not know what is the bug [18:48:38] but I'd like that finding and fixing to be prioritised [18:48:39] indeedy [18:48:43] mmm, not sure about that jynus, it's not as clear cut. Until we figure out how exactly the data was moving from one place to the other, I wouldn't make assumptions [18:49:15] replication has not worked there for month, if something was inserting the data, I do not know nothing about it [18:50:40] well, presumably replication was removed from the code by someone [18:50:49] jynus: that's exactly my point, we need to find how that data was being inserted, in case it's not a simple replication [18:50:54] If so, could we not reach out to them and see if they did that in order to switch copying over to another service?# [18:51:40] maybe springle knew something out of replication, but outside of that I am blind [18:52:05] specially on that host, that for performance reasons has all logging disabled [18:52:11] is he available via email? I think he could clear this up most likely [18:52:21] it seems like something that was missed in the transition, which happens a lot here [18:55:41] btw, dbstore1002 has data up to the minute for the other 200+ tables, so poking a hole in the firewall doesn't sound like a good approach. [18:56:01] the setup was the way it is for a reason, I remember sean talking about it. we should find that out first. [18:56:51] yeah, it's totally fine to wait until australia wakes up (or...goes to sleep and wakes up? I forget how timezones work) [18:57:01] apologies for being a grumpy customer [18:57:03] dbstore1002 has data up to the minute for the other 200+ tables [18:57:08] ^that breaks me [18:57:15] :) hahaha, I'm with you jynus [18:57:24] I'm feeling your pain man [18:57:27] and, in a way [18:57:45] makes it mostly "not my problem" - let me clarify [18:57:56] it is then not a replication problem [18:58:06] there is client code updating this [18:58:16] which confirms my suspicion [18:58:28] now we have to discover where :-) [18:58:33] jynus: if it's client code, we (analytics) haven't written it [18:58:45] no, probably is is sean [18:58:47] jynus: when you say client, you mean what? [18:58:55] like insert statements [18:59:02] jynus: what type of client? we can help you look for it [18:59:03] instead of binlog replication [18:59:10] it is not updates sent throug regular replication [18:59:22] something connects to the database an replies the inserts [18:59:37] once in the master, a second time on the slave [18:59:57] let me debug the slave to know from where [19:00:01] jynus: ok, this code has to exist somewhere [19:00:11] yes :-) [19:00:20] jynus: how would it look? [19:00:28] jynus: must be in a repo [19:00:56] check the client connections and know from where they connect [19:01:12] if it is being done partially now [19:02:25] jynus: client connections point to m4 [19:02:42] jynus: from the point of view of "inserts" of outside data [19:04:42] ok, I see a write now [19:05:25] on Edit_13457736 [19:06:02] from terbium.eqiad.wmnet [19:07:18] jynus: will a binary log of data that went into master will help you? [19:07:42] not really [19:08:32] I think I got it [19:08:40] at least it has a promising name [19:08:49] /bin/bash ./eventlogging_sync.sh [19:08:54] on terbium [19:08:59] first time I see it [19:09:16] jynus:i have a bunch of e-mails from sean about replication [19:09:22] mysqldump -h m4-master [19:09:32] jynus: so if a custom process was setup must have been so after may [19:09:34] it is dumping and writing data from the master [19:09:47] jynus: ahhh [19:10:06] jynus: in what machine? [19:10:12] terbium [19:10:25] jynus: i probably do not have permits [19:10:41] it will take me a bit to understand it, but know that I know what happens, I can fix it [19:10:54] ha! i do [19:10:57] hi [19:11:01] ori: do you know of any custom code to handle replication among master and slave in eventlogging db (we are talking with jynus in #wikimedia-databases) [19:11:02] yes [19:11:03] probably if the alters blocked the tables for some time [19:11:10] ori, too late [19:11:10] springle had something custom set up [19:11:13] :-) [19:11:17] oh ok :) [19:11:20] I discovered it independetly [19:11:25] but please tell us [19:11:27] about it [19:11:40] i don't remember the details but i think i have springle's description of it somewhere in my irc logs [19:11:58] it is ok, I discovered it on terbium [19:12:00] The stuff he ran from his user in terbium? [19:12:04] ok :P [19:12:32] ...you know what? [19:12:43] I'm never apologising for how fragile the analytics dashboards for Discovery are ever again [19:12:45] :P [19:12:52] jajaj [19:13:33] question is, I remember him mentioning that he wanted to change the system, I just did not remember that he had actually implemented it [19:14:07] jynus: nice! [19:14:15] there it is on /home/springle [19:14:19] so [19:14:27] jynus, nuria: https://dpaste.de/5nPO/raw [19:14:30] probably it is on the way to being fixed already [19:14:32] some context there [19:15:51] all right , thanks ori but "shinny start of the week" goes to jynus [19:16:02] it has been commented [19:16:06] from cron [19:16:27] maybe someone else is to blame here :-) [19:17:19] or it went crazy on alter table [19:17:35] or more probably [19:17:40] on alter table [19:17:43] afaik springle was experimenting with this not due to an existing issue with replication (he mentions that codfw is still standard) but because he was trying to be forward-looking and work out a setup that would continue to scale without needing serious interventions [19:17:44] the import fails [19:17:53] it'd be worthwhile to sync up with him about that if you get a chance [19:18:01] no, the why I know it [19:18:02] he also had some good initial results from tokudb i think [19:18:15] I didn't know it was full production [19:18:19] * ori nods [19:18:19] :-) [19:18:29] I actually agree with it [19:18:54] the problem is that it is so non-standard that need flashing lights for me to notice it [19:19:23] jynus: indeed, puffff [19:20:14] maybe the only tables that are desynced where the ones that were altered? [19:20:19] that may be a reason [19:20:27] for the import to fail [19:20:29] how about http://media0.giphy.com/media/3o85xGdyQjVT2ymANy/giphy.gif ? [19:20:40] re: flashing lights :P [19:20:57] ori, as an epileptic, not cool :-( [19:20:57] jynus: ok, i will not look at this closely anymore as it looks we now know the "how" and likely the "why", if you can update ticket when you have an ETA for fixing it (can be tomorrow, it is late in Europe!) that will be it [19:21:15] jynus: many many thanks for the prompt response [19:21:16] jynus: ohhhhhhh shittt. i'm so sorry. [19:21:22] i'm an idiot. [19:21:31] ori, no problem no photosensible :-) [19:22:01] nuria, it may take me some time, but I would expect having it tomorrow [19:22:31] thanks for the help to everybody [19:23:03] * ori PARTs, too many channels [19:58:20] all i see is flashing skeletons [19:58:27] * milimetric deletes that from history [19:58:54] thanks very much, Jaime [20:00:04] to be fair, I caused this with the schema change- but I didn't have a clue about this process running on terbium [20:00:29] I didn't even know dbstore1002 was a slave you used [20:00:40] that's more than fair :) [20:00:47] we should meet at some time and discus what you do and what you need [20:00:47] all's well that ends well :) [20:00:59] the list of tables affected are pasted into the incident: https://wikitech.wikimedia.org/wiki/Incident_documentation/20151022-EventLogging [20:01:16] I'll comment on the ticket so you can check against it, if it's the ones you altered, then you'll have more info [20:01:18] I will fill that up [20:01:55] but the issue was not the alter, but the custom replication- I thout it was a test, but it is full into production [20:02:05] right, "production" :) [20:02:08] we need to puppetize that [20:02:26] at least so it is at least documented on code [20:02:50] yes, I find searching the puppet repo is incredibly illuminating [20:03:11] so the more things puppet knows about, the better [20:03:15] well, better than running on a sean's screen session [20:03:19] :-) [20:03:32] yes [20:03:41] it would happen the same if terbium was restarted [20:04:08] problem is I focused on replication- and that was not a REAL slave [20:04:24] sorry for not believing you :) [20:04:32] you were right, replication was totally fine [20:05:22] to be fair, Sean's solution is good [20:05:48] we just need to make it better documented and aware for everybody [20:06:03] you depend on me for basic things, I need to know what those are [20:07:42] also, the impact was lower, just 7 -10 tables [20:09:08] hm, seems too coincidental that those other tables have a max around 15:00:00 on the same day [20:09:13] also, we should have monitoring on this [20:09:25] but we can look more tomorrow - you need to sleep and other human things :) [20:09:44] I will keep investigating, but I will wait for the process to backfill normally first [20:09:46] btw, whenever you want, we've been talking pretty heatedly about the analytics query service and the datastores we want to use [20:09:59] Druid is looking better and better but I'd love to pick your brain about it whenever you have time [20:10:08] I will be glad to help, I do not care much about the what [20:10:20] ok, let me know when you have time [20:10:25] but if I will help maintain it, I want to know in advence :-) [20:10:37] oh, no worries on maintaining this, we can take care of it [20:10:58] we were just curious what your opinion was of the tool [20:11:22] if you're curious until then, I loaded 1 day of data at hourly granularity into an instance on labs [20:11:40] we are also looking for alternatives to storage for metrics [20:11:50] this UI is alpha (we wouldn't use it in prod) but you can play with it: http://druid.wmflabs.org/pivot [20:12:09] cool, yeah, we should talk soon then [20:12:25] if you're free, just let me know and I'll invite you to a meeting this week [20:12:49] invite me, just I cannot promise availability [20:13:08] you are one of the many "users" I have to attend :-) [20:13:23] (and important one, however) [20:13:50] ok, cool, I'll just add you whenever we're talking about it next [20:14:00] no worries if you can't make it [20:16:29] (I am talking to myself) [20:17:10] actually no, there where more tables involved, all after (and including) db1047 GatherClicks_12114785 [20:17:21] it failed completelly after the first error [20:17:35] but it is backfilling automatically all of them