[00:02:14] homebrew [00:02:15] (CR) Jforrester: [C: 2] Add i18n for API module help [extensions/ContributionTracking] - https://gerrit.wikimedia.org/r/169811 (owner: Anomie) [00:02:29] (Merged) jenkins-bot: Add i18n for API module help [extensions/ContributionTracking] - https://gerrit.wikimedia.org/r/169811 (owner: Anomie) [00:02:32] ok found my.cnf, and it's empty [00:03:07] ok, just add these two lines (adjust paths as needed) [00:03:31] general_log_file = /var/log/mysql/mysql.log [00:03:31] general_log = 1 [00:03:52] then restart mysql [00:13:00] so I keep having this error [00:13:08] when I close mysql then restart [00:13:22] it doesn't update "the PID file" [00:13:36] which is something awight and I were troubleshooting [00:13:40] hmm. permissions? [00:13:53] permissions with? [00:14:01] do you do things with sudo on mac? [00:14:10] I do [00:14:25] so do the permissions change when you shut down mysql? [00:15:14] I used to just be able to go mysql.server start [00:15:16] just thinking that trying to start it again without sudo would have permissions issues on the pid file [00:15:17] and everything worked [00:15:21] but now I get this error [00:15:31] ah ok, I'll try that [00:15:59] hmm nope [00:16:10] huh. if you put # signs in front of those two lines in my.cnf, does it start up? [00:18:03] how are you shutting down mysql? [00:18:06] the specific PID file thing is probably caused by the lack of a pidfile path in mysqld.conf [00:18:25] nope :( [00:18:26] for now, shutdown is like "killall -9 safe_mysqld" :) [00:18:58] awight when I do that it says no matching processes were found [00:19:04] last time, we had to ps auxxww|grep mysqld and snipe the processes manually [00:19:05] and when I get the individual process, [00:19:11] lemme see what the exact names are [00:19:13] I get the process id and kill -9 [00:19:21] and it says process not found [00:19:26] then I try to look again [00:19:28] hrm [00:19:31] and the id has changed [00:19:31] killall -9 mysqld_safe ; killall -9 mysqld [00:19:34] werry odd [00:19:36] yep. [00:19:40] it changes every time hahah [00:20:03] the parent respawns its child, so you have to kill that first. pleasant unix 201 exercise [00:20:15] Actually, I think unix is graded in circles of hell [00:20:22] purgatory7 [00:20:30] yeah awight both of those produce the same result [00:20:37] which is that it's saying those processes are not found [00:20:43] with and without sudo [00:20:54] how do I kill the parent? [00:21:08] in that case, run the "ps auxxww|grep mysqld" command [00:21:10] does mysql.server stop or mysql.server restart do anything? [00:21:26] find the mysqld_safe process id and kill that first. then kill the mysqld process [00:21:36] ejegg: nah, they require the PID to do their job [00:21:48] how do I find the mysqld_safe process id? [00:21:51] I guess, since the mysqld.conf file is empty, there is no pid [00:21:55] those suggestions didn't work [00:21:56] pizzzacat: you run the ps...grep [00:22:01] I did that [00:22:03] what's the output? [00:22:18] ps auxxww|grep /usr/local/Cellar/mysql/5.6.17_1/bin/mysqld [00:22:19] sherahsmith 24713 0.0 0.0 2423368 188 s009 R+ 4:21PM 0:00.00 grep --color=auto /usr/local/Cellar/mysql/5.6.17_1/bin/mysqld [00:22:26] it worked, then! [00:22:41] now, apply 12V to [00:22:48] mysql.service start [00:22:48] mysql.server start [00:22:48] Starting MySQL [00:22:48] . ERROR! The server quit without updating PID file (/usr/local/var/mysql/sherah-smith-2.corp.wikimedia.org.pid). [00:22:54] meh [00:23:15] I'm going to killall [00:23:22] * awight ducks under the desk [00:23:24] -9 manhattans [00:23:31] you already did kill some more [00:23:49] unless ps shows that jumpstarting the service worked? [00:24:00] * awight grabs some more popcorn [00:24:04] en ingles por favor [00:24:29] imagine that you only have 4 tools that work [00:24:37] generous [00:24:50] yes this is no Gilligan's iland shite [00:25:02] what do you mean "ps shows that jumpstarting the service worked"? [00:25:12] searchlight: ps auxxww|grep mysqld [00:25:22] hammer: kill -9 NUM [00:25:41] sherahsmith 24830 0.0 0.0 2423368 188 s009 R+ 4:25PM 0:00.00 grep --color=auto mysqld [00:25:41] battry wires attached to each arm: mysql.server start [00:25:53] that should totally be enuf to get your server going again [00:26:10] move that mysqld.conf file to another name, for now [00:26:13] um but that's what I'm saying doesn't work [00:26:24] ok I'm confused [00:26:51] let's back up to "ps shows that jumpstarting the service worked" [00:27:01] and only use commands [00:27:07] so I know exactly what to type [00:27:09] your ps shows that there's no server running [00:27:12] goot [00:27:13] because sometimes I can't tell what's commentary [00:27:18] and what's commands [00:27:25] sorry :( [00:27:27] then u try to start the mysql.server. fail? [00:27:46] umm [00:27:46] perhaps it's failing because your conf file is bad. try renaming that file. [00:27:52] wait!! wait wait [00:27:59] hehe sorry, it's hard to type as I eat this popcorn [00:28:02] I can't tell what you're saying awight [00:28:15] so here's where I'm at [00:28:18] k [00:28:30] I got the error [00:28:41] saying the server quit without updating the PID file [00:28:45] then I typed ps auxxww|grep mysqld [00:28:56] and no process was running... [00:28:58] then I got sherahsmith 24830 0.0 0.0 2423368 188 s009 R+ 4:25PM 0:00.00 grep --color=auto mysqld [00:29:06] so what is my next step? [00:29:07] cool. [00:29:17] try renaming the config file [00:29:39] ok which config file? [00:31:03] awight ^ [00:31:19] I don't think u told us the exact path, but somewhere under /usr/local/Cellar/mysql/5.6.17_1 [00:31:46] oh oh ok [00:31:59] rename it just to get it hidden? [00:32:02] or to something specific? [00:32:24] yeah just to hide it [00:32:33] then, try starting the server again [00:32:45] ok! [00:32:56] nope same error [00:33:10] blahhh mysql [00:33:46] and ps... says no server is running? [00:33:51] pizzzacat1: ^ [00:35:15] while you're having fun, try tail /usr/local/var/mysql/Sherah-Smith-2.local.err [00:35:38] and even, ll /usr/local/var/mysql/Sherah-Smith-2.local.err [00:37:24] ps says sherahsmith 25042 0.0 0.0 2432784 516 s009 S+ 4:37PM 0:00.00 grep --color=auto mysqld awight [00:38:05] ewulczyn: I wish we had the graphix from that notebook you emailed [00:38:58] awight: do you want the plots seperately? [00:39:01] ewulczyn: however, there's this nasty aliasing effect we are fixing soon, where allocations only come in increments of 100%/30 [00:39:14] meh whatever, if it happens at a round hour, I'm pretty sure I know what's happening [00:41:33] My guess is that another campaign becomes active, and it causes the allocations to change slightly. Because of the 3.3% rounding effect, sometimes a banner will be robbed of some likelihood. you'd have to look at the allocations during that test, and see whether it's true, though. [00:41:38] ewulczyn: ^ [00:41:51] pizzzacat: anything in the error logs? [00:42:26] pizzzacat: also, I would suggest running mysql.service with its -v or --debug options, but I don't see documentation about that, anywhere. Where does this janky brew package come from? [00:43:52] ewulczyn: hmm, but that's much more than a 3.3% difference... [00:44:22] aargh, apparently you can't add order_id to staged_vars without some nasty side effects. Thank goodness for these tests. [00:44:26] ah no it still might be the case, because the worst case situation is actually 3.3% vs 6.6%, between a/b [00:44:32] ewulczyn: nooo [00:44:34] sorry [00:44:38] ejegg: noooo! [00:44:59] srsly... [00:45:13] NOOO [00:45:22] order_id is totally off in its own evil bubble [00:45:32] maybe.... good magic [00:45:35] but definitely unique magic [00:47:36] ewulczyn: I don't understand why your analysis is blocked, though--can't you just normalize out # of impressions? [00:48:00] it's the "combine" you can't do? [00:49:03] well, I can override normalizeOrderID and regenerateOrderID to set those transaction-specific values after calling parent. Blecch. [00:49:17] awight: the analysis is not blocked. The last comment was directed at Megan to wanr her against combinging the data. I am confused about how the campaigns were interacting in the way shown [00:51:41] ewulczyn: it's terrible, but imagine you have four banners equally sharing allocations. [00:51:59] Because we quantize to 30 allocation "slots", you cannot have four exactly equal banners. [00:52:27] Instead, two will be at 7/30 and two at 8/30 of the allocation [00:53:05] I think that's what happens at day 3 of your test--some other campaign starts to overlap with this test, and it causes the relative allocations between A and B to be skewed by this +/- 3.33% error. [00:53:30] ... which can add up to a very large effect if you were already only getting 1 or 2 slots for each banner. [00:54:11] Fortunately, AndyRussG|hmwork is about to fix this by making the allocations proportions continuous. After hmwork ;) [00:56:06] !log disabling all queue consumers. [00:56:15] Logged the message, Master [00:56:21] oh wow. I blew away the entire wmf_contribution_extra table. [00:57:01] well. [00:57:09] fired. [00:57:33] oof. tough day. [00:57:43] I don't even know... [00:57:50] That is not really recoverable... [00:58:10] only from binary logging [00:58:16] eek [01:01:55] awight sorry I had to take a little break [01:02:18] how do I get mysql back up and running? this is driving me crazy [01:02:55] pizzzacat: I'm too busy crying at th emoment [01:03:02] for myself [01:03:02] ha ok [01:03:09] I'm supposed to go to a happy hour thing [01:03:23] I think I spent 33% of my day on this issue [01:03:57] I'm particularly offended at how it eventually worked, one of the times, and now it's borken again. [01:05:41] awight: extra_restore still exists. not what you need? [01:06:23] ejegg: it's about 20,000 donations old... [01:06:30] erk [01:06:40] (PS1) Ssmith: update gauge library and presentation [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/171486 [01:06:46] oh, same backup as dev_civicrm? [01:06:58] yeah I know. [01:07:04] well I gotta run, thanks for the help. [01:07:04] yeah. I don't feel comfortable leaving a gap in the db, cos if the IDs get duplicated we'll be bailing that sewage for days. [01:07:11] pizzzacat: hehe, good call [01:07:19] cya pizzzacat [01:09:19] our Jenkins dashboard is... really dark, now. [01:10:16] this binary logging you speak of... we have it? [01:10:27] yeah, I don't know how long the files last for, though [01:10:52] hopefully, since our last backup at least. [01:11:08] Sounds like this may merit pestering Jeff [01:11:27] for sure, I hopefully SMS'd the right tel no. [01:16:05] awight: woo interesting theory... Yeah we will fix it... :) [01:16:59] * awight continues to feel effusively positive about this planned work [01:18:56] fun fun yeah :D [01:19:25] school's out, for ever! [01:26:35] and I'm out for the evening. Let me know if I can help with anything [01:27:04] ejegg|away: kbye! [02:01:23] Jeff_Green: I just can't believe... you even looked me in the eye and mentioned that it was dangerous to restore tables. and i looked tasty with ketchup. [02:01:54] fwiw, the first thing I tried to do was load that data into a scratch database. Can we have one of those, in the future? [02:02:46] awight: we're going to need to stop mysql on db1025 for a minute [02:02:47] I was too lazy to dig out the pgehres password [02:02:55] Jeff_Green: sure. most jobs are off [02:03:13] i think we should rethink how we do bulk stuff on the prod db's [02:03:18] yep. [02:03:32] always always a script, preferrably CR'd [02:03:37] we could have stopped replication to db1008 to cover ass for example [02:03:45] that would have been real nice [02:03:46] anyway, for another day [02:03:52] I could have stopped all the queue consumers [02:04:07] so right now what do we need before stopping mysql on db1025? [02:04:14] I think it's fine. [02:04:35] I'll turn off the Civi web gui, one moment [02:04:42] what about payments? [02:05:14] !log CRM: drush vset maintenance_mode 1 [02:05:25] Logged the message, Master [02:05:29] Jeff_Green: ooof you're right [02:05:30] SPOF's [02:05:36] awight: is dev_civicrm.wmf_contribution_extra still clean? [02:05:38] yeah payments needs to come down [02:05:42] also smashpigs [02:05:53] Jeff_Green: yeah we can start with that data [02:06:03] it's 20,000 rows behind [02:06:24] Let me know if it's going to be too much of a headache to replay those last rows [02:06:42] it's possible that we can recover most of it from nightly audit files [02:07:10] ... major gifts however... not so nice, if they were doing anything [02:07:26] Jeff_Green: oh, not dev_civicrm [02:07:40] Jeff_Green: start with civicrm.wmf_contribution_extra_restore [02:07:51] dev_civicrm is 1M rows behind [02:08:53] Jeff_Green: sorry, nvm you're right [02:09:06] dev_civicrm was updated, it's 20k behind, now [02:10:05] awight: so lutetium dev_civicrm.wmf_contribution_extra right? we'll start by dumping that and restoring it to db1025 [02:10:33] Jeff_Green: yep. that data is already available on db1025 though, if that's faster [02:10:40] as civicrm.wmf_contribution_extra_restore [02:10:45] that should be clean, still [02:11:00] Jeff_Green: is it possible to do this without taking db1025 down, btw? [02:11:14] payments outage is a big deal for tests, unfortunately [02:11:26] we want to take db1025 down to cleanly copy binlogs [02:12:02] I can copy most of them so we don't need it down for more than a few minutes [02:14:21] Jeff_Green: cool [02:14:36] Jeff_Green: I need to do a few minutes of baby stuff... brb [02:35:22] awight: questions [02:35:23] Jeff_Green: ok I'm here, let me know if I can do anything [02:35:27] k [02:36:40] is there any reason we should not simply rename the restored table? [02:36:50] Jeff_Green: that would be great [02:36:53] as a starting point [02:36:54] if possible [02:37:07] is springle's backup done? [02:37:18] ok. and once we do that, will anything use or write to that table in our current operating state? [02:37:28] no [02:38:06] alright [02:38:33] Jeff_Green: lemme know when you're ready to disable payments [02:38:36] and once we do that we'll have a gap from when the dump ran to just before the table first got corrupted, that's about 20K rows? [02:38:47] i think we're ok without stopping mysql [02:39:03] oh wow k [02:39:10] yes to your summary [02:39:38] so backfilling that from binlogs will be fairly a pain in the ass [02:40:02] I bet [02:40:08] What can we do? [02:40:43] basically we can get the raw sql in order, but pulling out what affected just that table in the affected timespan will be...a process. [02:40:52] If we have to lose those entries it's probably equivalent to many hours of coverup by Donor Services... [02:41:01] coverup? [02:41:18] they'll probably end up answering a lot of emails [02:41:24] ok [02:41:37] How hard do u think this is? [02:41:43] not sure yet [02:41:52] I was imagining, we could go back to the 13-hr-old backup [02:41:55] one important question--how predictable are the queries that write to the table? [02:41:57] then replay everything. [02:42:06] Jeff_Green: oof. Mostly deterministic. [02:42:11] dammit. [02:42:16] awight: is there any pattern to the statements that write to the table? say all INSERT, or all have an SQL Comment [02:42:36] springle: we only care about insert and update... [02:42:48] never REPLACE or DELETE? [02:42:50] they probably do conform to a small number of patterns, if that helps [02:43:03] no, I don't think so, but give me a minute to check. [02:43:28] so what if we just replayed the binlogs against dev_civicrm [02:43:33] there is a foreign key with delete cascade there [02:43:38] Yeah I don't think we ever replace [02:43:52] oof, there is some data which is updated by triggers, but I'm okay omitting that [02:43:57] any other tables that will break, or be broken, by blindly replaying stuff? [02:44:08] Jeff_Green: I'm fine with the full replay [02:44:19] springle: if we replay on dev_civicrm it doesn't matter what happens to the other tables [02:44:27] cool. [02:44:29] at least, not until we're talking about foreign keys that get out of sync [02:44:37] ok [02:44:43] just raising questions [02:45:08] dev_civicrm is just a scratch db people can beat on until the next time we refresh it [02:48:06] awight what about reconstructing and/or validating restored data from logs? is that possible? [02:49:07] Jeff_Green: from the application domain logs, like audit? [02:49:20] yes it's possible for the online stuff, but we'll lose any Major Gifts work [02:49:23] if any. [02:49:44] ok [02:49:49] something like http://paste.debian.net/130478/ [02:49:53] Jeff_Green: ^ [02:49:56] looking [02:50:44] springle: i'll copy the binlogs over to lutetium and we can try it [02:51:05] would need to load ignoring errors i guess. there will be PK clashes, and the multiple restores will be troublesome [02:51:07] statements are each on one line? [02:51:26] awight: we'll find out ;) [02:51:49] hehehe [02:53:27] i suppose we could skip the grep and just use --database=civicrm, since we don't care about lutetium dev_civicrm [02:57:23] many updates seem to be of this form: /* https://civicrm.wikimedia.org/user/[REDACTED] */ UPDATE wmf_contribution_extra SET no_thank_you = '',postmark_date = null WHERE id = 11065361 [02:57:36] all having the url comment [02:57:40] alright the binlogs are there [02:58:51] not many inserts showing up. possibly multiline [02:58:58] springle: yeah I would expect all the lines to have that URL comment, but fwiw that's shared by anything else coming from our Drupal/CiviCRM codebase. [02:59:51] also, I'm sure it's painfully obvious by now, but the place to stop is when I type "drop table wmf_contribution_extra". [02:59:56] * awight hangs head [03:01:00] i'm gonna print you a barn star of shame :-P [03:01:17] tattoo, pls [03:01:32] my scarlet letter [03:01:36] ha [03:01:46] acquiring next target: enwiki [03:02:10] no... dewiki [03:02:30] their counterstrike will completely annihilate all our bases :D [03:02:40] yeah, i think we need to avoid the grepping and just mysqlbinlog --database=civicrm [03:03:03] too many conditions to try and meet otherwise [03:03:25] springle: wrt your earlier comment, there should be roughly 20,000 insert statements between the last backup, and my fail [03:03:35] donno how many updates. [03:05:22] awight looking at keys in this table...which of them could concievably become out of sync if we the binlog replay timeframe doesn't line up? [03:06:12] I think, just id, which is not actually a problem [03:06:21] ok [03:07:07] * awight sucks in breath [03:07:15] where does unique_entity_id come from? [03:07:37] thinking more, if we are replaying all the tables, entity_id could go bad, and that would be bad [03:08:08] If we can get max(civicrm_contribution.id) to line up, that will be good enuf [03:08:45] in dev_civicrm, that is 8269914 [03:09:16] if you search for that in the binlog, you should find statements about insert into wmf_contribution extra set entity_id=8269914 [03:09:20] right there, I think. [03:09:24] where is that from though? [03:09:38] is that an autoincrement from some other table? [03:09:45] select max(civicrm_contribution.id) from dev_civicrm. [03:09:51] autoincrement from that table. [03:10:28] * awight relives my "drop table" nadir [03:11:22] If I can look at this binlog, I might be able to help identify the step [03:12:42] After today's performance though, I understand if u want to keep me far away from db1025 :) [03:13:01] sec, I'll chown the files [03:13:34] lutetium:/srv/binlogs_from_db102 [03:13:36] k [03:14:24] Jeff_Green: do you have the plain SQL version somewhere? [03:15:18] mysqlbinlog > out.sql [03:15:39] I only see *bin.* [03:15:44] oh I get it [03:15:56] well lemme run that paste you uploaded, then [03:15:59] for some reason rsync didn't want to set the file dates... [03:17:00] ok i'm creating replay.sql.gz [03:17:29] then I'll try to trim at head and tail to match the dev_civicrm state, up to my drop table statement. [03:17:58] will be many GBs... [03:20:12] sorry--what are the BINLOG statements? [03:20:21] it's all base64-encoded [03:26:08] awight: frack uses MIXED binlog format; that is, some will be plain SQL statements and some will be actual binary row copies [03:26:40] springle: ok, interesting to see it in practice [03:26:41] the latter should only occur for statements that are non-deterministic, and some fring cases [03:27:12] you should determine if the statements you need are all using plain SQL format, like the example UPDATEs i posted above [03:27:38] hmm [03:28:17] if not, we'll have to try replaying against the restored db and hope. the binary rows would still be applied, but it's harder to identify stuff [03:28:18] Can you check for triggers on the wmf_contribution_extra table? [03:28:42] there are no triggers in civicrm [03:28:57] oh good, thx [03:29:06] the drop table statement should be plain SQL [03:29:11] hard to miss :) [03:29:18] well, then I think the queries are all deterministic and are hopefully recorded as plain SQL [03:30:27] Okay, I digested all the binlogs into replay.sql.gz [03:31:33] the backup was somewhere shortly after line 94875100, trying to narrow that down now... [03:37:02] it turns out it's really a long process to gunzip 300MB of cruft [03:37:11] yes it does [03:37:39] oh 800MB, zipped [03:38:14] 16 cpus? ok using them now [03:43:50] how do you want the results, should I trim the SQL concatenation, or give you the "# at NUM" lines? [03:45:48] why not generate the SQL you need and try fixing the table on lutetium? you have the best knowledge of what the data should look like... [03:45:54] k, I found the first cutpoint [03:46:12] do I need special perms to run BINLOG? [03:46:38] awight to feed it back into the db? [03:46:53] if you can run the mysql client, just pipe the sql to it, or use "source " from mysql cli [03:47:14] awight, before you do let's make a copy of that table [03:47:23] mysql dev_civicrm otherwise I'll have to restore the full db if it fries [03:47:44] gimme a minute, I'll copy it aside [03:47:50] well, I'm planning to replay all tables fwiw [03:48:10] orlly? ok, nm then. [03:48:19] hmm. [03:48:36] thinking if this goes south we're going to have an hour wait to restore the db [03:49:23] oooooo oof [03:49:26] awight: your 'less' is causing lutetium to swapdeath :-P [03:49:29] hehe [03:49:33] ok killing [03:49:42] i can stop mysql and free up RAM if needed [03:49:49] should be freed now [03:49:52] k [03:50:01] right, ok I see those alerts now [03:50:49] i installed pigz in case you want to try a faster decompressor [03:51:30] yes [03:52:50] wow. faster [03:57:45] The winning replay script is exact_replay.sql.gz [03:58:05] I'll go ahead and try pigz -dc exact_replay.sql.gz | mysql dev_civicrm [03:58:11] Jeff_Green: shall I? [03:58:25] sure [03:58:27] k [03:58:37] ERROR 1227 (42000) at line 190: Access denied; you need (at least one of) the SUPER privilege(s) for this operation [03:58:48] ah [03:58:56] I can run it for you [03:59:16] let's... tail -n +190 [03:59:38] yeah [03:59:48] tail what? [04:00:12] there were some insert statements in the first 189 lines, so I'm assuming those entities exist now? [04:00:15] so, [04:00:27] pigz -dc exact_replay.sql.gz | tail -n +190 | mysql dev_civicrm [04:00:52] oh because it made it that far, now I follow. [04:00:54] no! [04:01:00] crap, some were rolled back [04:01:03] this is horrid [04:01:08] oh noes [04:01:14] I haven't done anything yet jfyi [04:02:38] Jeff_Green: ok, it's safe to play from the beginning. nothing had changed. [04:03:42] ok. [04:04:03] oh, heh. I just realized that this script contains my other mistake [04:04:16] hmm. before I do this, I'm a little concerned about the possibility of writing to the 'civicrm' db [04:04:19] which nulled-out the wmf_contribution_extra.no_thank_you column [04:04:24] how can we be positive that won't happen? [04:04:32] I think we're on a different db server [04:04:48] i know it won't kill db1025, I just mean lutetium's civicrm database [04:04:55] oh. [04:05:03] corrupting that buys me a replication repair shuffle [04:05:30] there are lines line, #141105 9:00:10 server id 64040100 end_log_pos 41182335 Table_map: `civicrm`.`wmf_contribution_extra` mappe [04:05:40] do you think those get executed via magic? [04:05:43] I would believe it. [04:05:44] that's what got me thinking [04:06:12] springle? what do you think? [04:06:25] meanwhile, I'll find my first bad query in this script [04:07:22] I don't see it... what gives [04:07:30] there are a bunch of use `civicrm`/*!*/; [04:07:47] i think we may need to s/// those [04:08:18] oh. I just realized. since I killed my original bad query, where I thought I set no_thank_you=null, it may have never committed that statement. [04:08:28] hence I spent all day ruining things for naught [04:09:29] Jeff_Green: if you want. sed -e 's/`civicrm`/`dev_civicrm`/' [04:10:23] awight: what command did you end up using to generate the reply? [04:10:27] still grepping? [04:11:21] springle: I used your paste, but without the grep filters [04:11:22] * Jeff_Green leaving the sed to awight [04:11:49] Jeff_Green: btw, swapdeath is all mysqld at this point [04:12:07] k [04:12:18] ouch, 7G of swap for 64G of ram... [04:12:35] no big whoop [04:12:44] * awight hears a bus screech by [04:12:46] activemq is 7.5GB [04:13:29] much of mysqlbinlog output is informational. the end_log_pos stuff should be comments [04:13:31] oh man, java 22GB [04:13:47] jenkins is like the emacs of CI [04:14:01] it's like the handgrenade of wristwatches [04:14:16] xemacs -elisp | crontab -e [04:14:36] Jeff_Green: ok I guess we're good to replay, then? [04:14:51] we still need to change the db name throughout the file [04:15:05] I think springle was saying that shouldn't do anything [04:15:14] they're not commented? [04:15:25] no, i said the end_log_pos type lines shouldn't do anything [04:15:45] the 'use ...' lines will definitely change the default db on the fly [04:15:47] but eh [04:15:48] pigz -dc exact_replay.sql.gz | sed -e 's/`civicrm`/`dev_civicrm`/' | mysql dev_civicrm [04:16:02] let's fix the file itself [04:16:06] * awight looks for use lines [04:16:14] I don't trust shell commands with escaping and backticks [04:16:24] heh [04:16:40] oh god 9223425:use `civicrm`/*!*/; [04:16:43] right ok [04:17:52] Jeff_Green: now, the good file is exact_replay2.sql.gz [04:18:04] should be free of `civicrm` references [04:18:12] grand [04:18:51] alright, any last thoughts before we launch the cows over the wall? [04:19:27] Tartars never get smallpox. [04:19:51] ERROR 1609 (HY000) at line 190: The BINLOG statement of type `Table_map` was not preceded by a format description BINLOG statement. [04:19:56] noooo [04:20:06] what does that mean [04:20:10] that is the cow bouncing off the parapet [04:20:18] moof [04:20:54] http://bugs.mysql.com/bug.php?id=72804 [04:20:59] maybe [04:21:53] no, probably not [04:22:22] awight: Jeff_Green let me review the file. where on lutetium? [04:22:38] /srv/binlogs_from_db1025 [04:22:53] yeah /srv/binlogs_from_db1025/exact_replay2.sql.gz [04:23:20] the nearest Table_map is line 199 [04:23:21] I snipped after the COMMIT following a plain sql statement [04:23:37] and there's lots of idle-seeming stuff before the offending binlog [04:26:34] springle: would it help if I figure out the step IDs and used the mysqlbinlog tools to cut at those points, rather than doing it all textutils style? [04:28:10] awight: using --...position arguments? if you can [04:28:19] ok [04:40:57] awight: Jeff_Green: about to hammer lutetium [04:41:05] ok [04:41:18] eh [04:44:58] gunzip can handle concatenated gzip files, right? [04:45:43] good question, i dunno offhand [04:45:58] Jeff_Green: ok here's the thing: replay3.sql.gz [04:46:22] it's been snipped by mysqlbinlog, rather than textutils [04:46:34] ok [04:47:09] it's not compressed [04:47:14] wat [04:47:23] right [04:47:33] replay3.sql [04:47:54] ok, shall we enfire it? [04:48:06] please [04:48:20] ERROR 1146 (42S02) at line 211: Table 'civicrm.wmf_contribution_extra' doesn't exist [04:48:24] mooo [04:48:25] uh oh [04:48:28] hah ok that's... [04:48:30] that [04:48:34] WAT [04:49:13] oh. the BINLOG base64 cruft contains hard db name references [04:49:18] charming [04:49:53] yep. [04:51:01] soooo [04:51:11] awight: Jeff_Green: suggest all stop. let me play with lutetium for a while [04:51:13] happily, I'm mostly certain that nothing succeeded in modifying the civirm db, in those first lines [04:51:21] springle: sold [04:51:57] we should switch users to one that can't write to any of the slave dbs [04:52:03] springle: fwiw, I did the snipping by --*position params now. results are in the file replay3.sql [04:52:48] springle: is it helpful if I regenerate that file, so the references are consistently to `civicrm`, again? [04:56:28] replay3-civicrm.sql.gz is my attempt at snipping, now referring to `civicrm` everywhere [04:56:46] Let me know if I can do anything [04:58:08] armchair observation: maybe this would run if dev_civicrm were renamed to civicrm [05:00:05] hey guys... how's it going in the trenches? [05:00:13] cvirtue was looking for an update a min ago. [05:00:25] also, my sympathies. [05:00:27] :/ [05:00:51] still down [05:01:16] fair enough [05:01:35] good luck... let me know i if i can do anything besides moral support [05:01:38] we've restored the affected table from backup, and now Sean and Awight are looking at backfilling the missing rows from binary logs [05:01:51] oh well that sounds like major progress :) [05:02:25] Jeff_Green: I'm ready to move on without those missing rows... [05:02:27] sorta kinda. the easy part is done anyway [05:03:00] I can fake out the data for now, and some glorious day in the future we can repair those rows correctly. [05:03:13] awight maybe while sean is looking at the binlog approach, is there anything we could do using logs? [05:03:33] not really, that would mean starting the jenkins jobs up again [05:04:05] well... most of the audit things can be run multiple times, if we have to [05:04:13] still, it's the major gifts work I'm most loathe to lose. [05:04:25] yeah [05:04:44] unfortunately, our audit stuff is not yet smart enough to repair existing data--it can only create new rows [05:05:15] just to increase the heat a bit, Major Gifts has an event this Thursday, so I'm assuming they have been busy... [05:05:44] but. lemme go ahead and make fake wmf_contribution_extra rows for the missing data [05:07:20] isn't sean still looking at the binlogs? [05:08:01] springle: I'm going to focus on getting our production db ready to use again, without those 20,000 missing rows. If you succeed in whatever crazy thing you're doing, I'll use that corrected data to improve my plastic surgery disaster... Sound good? [05:09:01] Jeff_Green: we can always take the result of what he's doing, and overwrite the fake data. [05:10:06] you'll be able to identify the fake data? [05:10:10] Jeff_Green: yah [05:10:34] Jeff_Green: one last thing, can you rename the _restore table in production, to wmf_contribution_extra? [05:10:42] yeah [05:10:50] thx. [05:10:59] I'm worried that we left springle in the mines... [05:11:40] table renamed [05:11:43] k [05:11:57] should I stop replication while you create rows? [05:12:08] Jeff_Green: um another favor. can i have a scratch db in case I need it? [05:12:19] i think that's the wrong approach [05:12:22] Jeff_Green: I donno about replication. No harm, I think, since we're backed up [05:12:31] Jeff_Green: wait what is wrong [05:12:40] restoring to the production db like that [05:12:49] fair [05:13:14] I was thinking, sean is busy in dev_civicrm tho [05:13:17] imo we should use the live slave on lutetium to generate SQL that we can check then run on the master [05:13:38] you have a scratch db of your own on lutetium [05:13:47] sure, good plan [05:23:07] Jeff_Green: is there a frack box without apparmor, where I can easily start a second mysqld instance? [05:24:01] lutetium is the best bet really [05:24:12] there's not another db class box [05:24:33] is apparmor blocking you there? [05:33:41] worked around it [05:33:46] k [05:37:48] Jeff_Green: I've synthesized the rows in awight.wmf_contribution_extra, now gonna dump them and reimport on production. [05:38:00] ok [05:38:04] You want to take a look, or prefer blinders :D [05:38:17] i can look if you want [05:38:22] watch out for the drop table! [05:38:31] grampa! [05:38:37] quit driving by braille! [05:38:57] actually [05:39:21] let me know when you're ready to restore, and I'll pause replication until you're confident it went well [05:39:55] sure [05:41:37] Jeff_Green: poised to import. [05:42:39] go. db1008 stopped [05:42:43] going [05:43:26] examining... [05:44:06] Jeff_Green: looks good [05:44:13] I'm starting Jenkins things again [05:44:19] ya? ok. starting db1008 [05:45:39] Jeff_Green: jobs look healthy. [05:45:47] great [05:46:18] Jeff_Green: springle: ok, production fundraising stuff is back up, i've synthesized the missing rows for now. [05:46:27] alright [05:46:41] I would still like to backfill the missing data, but it's low priority, now [05:54:24] awight you see the ty failmail right? [05:57:53] Jeff_Green: hadn't seen it, ok thx. [05:59:39] awight: Jeff_Green: managed to get the binlog loading. testing now [06:00:09] wowza [06:01:11] had to blow away all the foreign key constraints, and force continue when records are missing. you'll have to vet the result manually [06:02:40] nice [06:03:41] anything but nice, really [06:04:04] ha [06:04:46] but it sounds so much more grateful than "THE HORROR!" [06:08:10] it isn't going very fast, because this tiny mysqld instance is like 128M memory [06:09:52] awight: is tomorrow ok for backfilling? i can email you guys an update tonight, and leave the result on lutetium [06:10:54] springle: absolutely, thanks for that amazing endrun [06:11:57] well, no chickens hatching just yet [06:12:59] ... I hatched a few batches of duck eggs in my bedroom for a few months... it can be heartbreaking [06:13:18] but when it's all over: ducks! [06:27:14] * awight cries self to sleep [06:40:03] (PS4) Awight: WIP Customized LYBUNT report [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/170268 [19:33:31] Hey, team. Looks like you guys had some fun while I was out. [19:33:31] Anybody want to give me the summary? I can't find the real emails amongst the queue consumer shrapnel. :p [19:33:31] Hi K4-713 [19:33:32] Good morning/afternoon/whatever, ejegg! [19:33:32] So, we... went up and down a couple times? And late night maintenance? And... [19:33:32] The wmf_contribution_extra table got wiped out at some point, and awight, springle and Jeff_Green managed to restore it [19:33:32] * K4-713 smirks [19:33:32] Wow. [19:33:32] big fun! [19:33:32] Hey Jeff_Green! And... a very Good Aftermath to you, too. [19:33:32] the emails with the details are 'Queue consumers are paused' [19:33:32] Ah, there we go. [19:33:32] Thank you. [19:33:32] My archive finger was getting tired. [19:33:33] hey! [19:33:33] i'm not sure where we are in terms of backfilling the missing rows [19:33:33] awight seemed to indicate that some of that happened... [19:33:33] ...but, it's hard for me to tell at this point. [19:33:33] sean managed to play back the replication binlogs against a test instance, and that table is in a db on lutetium [19:33:33] Baller. [19:33:34] i'm not sure if adam has had a chance to check that table and import the rows back to production [19:33:34] sean mentioned some uncertainty about order in an autoincrement column [19:33:34] Makes sense. [19:33:34] Besides all that, Worldpay flipped a switch that broke everything, in an attempt to make refunds easier. [19:33:34] ha, brilliant [19:33:34] ejegg: I think they call that "solving the problem upstream" or something. [19:33:35] They flipped the switch back the other way after a day, but to get the easy refunds we have to send different order ids for each transaction in a donation attempt [19:33:35] This patch might do that: https://gerrit.wikimedia.org/r/171561 [19:33:35] ejegg: Yeah, so... different order ids in WP are not going to work without a scaryass set of changes. [19:33:35] card is here: https://wikimedia.mingle.thoughtworks.com/projects/online_fundraiser/cards/2144 [19:33:35] The only reason it works now, is that we sneakily base all the order IDs on ctid. [19:33:35] that patch doesn't actually change the order ids as far as most of the adapter is concerned - it just adds a few transaction-specific values to add a suffix to the OrderNumber for all the transactions besides the AuthoriseAndDeposit [19:33:35] So, it's ctid.attempt.someotherinteger? [19:33:35] Actually, I can tell you right now that this is definitely going to break the new audit script, too. [19:33:35] yep [19:33:35] I was hoping it wouldn't since we don't change the order ID for the real charge, just for the 10 cent auth test and the token mumbo jumbo [19:33:35] Well, it's going to use whatever actually makes it into their reconciliation files, and grep payments logs for that exact thing, so... [19:33:35] ...I think the auth test is the one that contains the donor info. [19:33:35] I think. [19:33:46] So the thing in their logs will likely be the real charge XML's OrderNumber (still the same order_id), which will be a substring of the auth test XML's OrderNumber. But the log prefix isn't generated from the transaction-specific XML values, it's the raw order_id from DonationData, which that patch doesn't touch. [19:33:46] I think it may work out [19:33:47] ejegg: I don't think it's using the log prefix. [19:33:47] ...for some... insane reason I can't really... remember. [19:33:47] Hmn. [19:33:49] Well, if it's using a regex with word boundaries on either end, the suffixed version should still match [19:33:49] let me take a look at that script [19:33:50] Also, I don't actually know what oid comes through in their recon files. [19:33:50] ...no way to tell up until now. [19:33:50] right [19:33:50] The good news is that all of these things should be easy to fix. [19:33:50] ...compartmentalized. [19:33:50] yah [19:34:00] ejegg: Hey, do you happen to know what caused the TY job failures in the first place? [19:34:00] You know, before the unplanned table prune. [19:34:00] Ahh, no [19:34:00] The errors were all in instantiating the phpmailer object, i think [19:34:00] Wasn't there something... insane about an amazon subscription that sent us a mutant message? [19:34:00] Yeah, that screwed up a couple queues - was that this week too? [19:34:00] Thought that was friday. whew [19:34:00] crazy times [19:34:00] Was it? I... could be making that up. [19:34:00] Adam notified me of it on Monday... I'm on a loaner laptop so can't check the message in Civi [19:34:00] but here it is if you're looking: https://civicrm.wikimedia.org/queue/donations-damaged/amazon-c9a20b96-771d-41c4-86cc-1e90402ce31c [19:34:00] yep, guess that was monday [19:34:00] can I butt in really quickly and ask if either of you know if/how I can connect to lutetium on a loaner laptop? [19:34:01] ooh, you'll need your ssh keys [19:34:03] blegh [19:34:03] ccogdill_: How long are you going to be suffering with the loaner? [19:34:03] And: Are you going to get the old one back? [19:34:03] no idea, but Chip will be looking at my laptop after metrics [19:34:03] so maybe! [19:34:03] ah, okay. [19:34:03] Yosemite killed it :/ [19:34:03] Oops. [19:34:03] Well, shoot. How important is it to get back in before we know if the old machine is going to work again? [19:34:03] I can wait till after metrics, it's just not ideal [19:34:03] if I have my lutetium public key is that enough? or I need the private one too? [19:34:03] hmm I guess that sounds obvious now that I ask [19:34:03] :D [19:34:03] :p [19:34:03] I'm guessing that you didn't back up the private key... [19:34:03] it appears not... [19:34:04] I'll just cross my fingers Chip can work his magic [19:34:04] ...so, if we can't retrieve it, you'll have to get new keys. [19:34:04] yeah [19:34:04] maybe [19:34:05] we'll see [19:34:05] And, I mean, nobody should have extra copies of private keys wandering around, so don't take this as a lesson to copy it all over the place. [19:34:05] The only place I have mine backed up is on a fully encrypted USB stick that nukes iteslf if you get the password wrong too many times. :) [19:34:05] * K4-713 makes the serious face [19:34:05] haha where I can get one of those?! [19:34:05] then it bites your hand [19:34:05] atgo: it says I'm not allowed to join the video call :( [19:34:06] ccogdill_: Annoyingly, thinkgeek no longer carries them. [19:34:06] AndyRussG i sent you an invite [19:34:06] ? [19:34:06] what? you'd think that would only increase in demand [19:34:06] weird [19:34:06] ccogdill_: It's close to this, though: http://www.ironkey.com/en-US/encrypted-storage-drives/250-personal.html [19:34:06] okay maybe I'll get one of those after this is all figured out [19:34:07] They are only slightly crazy-expensive. [19:34:07] But, you know: Safety first. [19:34:07] Okay, heading upstairs for metrics. [19:34:27] (PS1) Ejegg: Send different OrderNumbers for each WP transaction [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/171561 [19:34:28] (CR) Ejegg: "Gonna hold off CR till the bower_modules submodule and bower.json updates are ready" [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/171486 (owner: Ssmith) [19:40:19] hey ejegg - just a quadruple check that the cookie changes from earlier this summer are OK [19:40:27] i know you set the date to start sometime around now [19:40:37] or expire or something [20:05:41] atgo: They should be! [20:05:51] cool, thanks! :) [22:34:05] Is there standup todaaaaaay? [22:34:20] for a year and a day [22:35:24] awight: boo! [22:36:11] AndyRussG: hey! [22:36:35] :) [22:36:54] I'm alone... in the infinite universe... of a hangout... [22:38:12] The icon for turning the web cam on or of in hangouts is so passé... [22:38:21] hehe [22:38:38] sorry, we had an emergency long team lunch [22:38:48] I hear comrades slowly filling in around me [22:38:48] ah np [22:38:50] but not quite [22:39:02] a tasty emergency then [22:42:42] so... standup? no standup? [22:42:57] it doesn't look like it [22:43:07] the voices faded away [22:43:28] heheh OK [22:43:31] I'm... not officially working, just poking at the LYBUNT nonsense a bit while my kid's asleep [22:43:43] yesterday's meltdown was ridiculous [22:44:00] I guess I didn't learn much about it even... [22:44:38] I just got access to the FR cluster, BTW... should I be doing something to help? [22:45:01] For the 'do not solicit' ticket, is there any expectation that the field appear with the 'do not email' and 'do not phone' fields, or is a regular custom field in its own group fine? [22:45:06] Hopefully Jeff and Sean forget and forgive [22:46:04] And Phil, Jessica and Laura too? [22:46:35] they are history [22:46:46] ? [22:46:52] I was afraid to remind them by actually pinging full names :) [22:47:07] it was terrible. I thought I'd damaged a column in the CiviCRM database... [22:47:16] uuuf [22:47:18] went down this long long road to try to repair this damage [22:47:43] by the time I'd accidentally dropped the entire table and had to turn off all our jobs, I realized I hadn't actually been successful in the original damage [22:48:07] ouch [22:48:23] yah 15 hours later... [22:49:03] stealth apparent damage causing greater damage... [22:49:10] did it get fixed? [22:56:20] hey ejegg AndyRussG awight pizzzacat1... sorry for the sloppiness around standups lately. could we shoot for 1:03 daily going forward? [22:58:55] atgo: np, 1:03 is fine for me... [23:01:57] ok, I like that time too! [23:08:44] Yeah that'd help me be more flexible with homework helping time here... speaking of which... [23:41:02] atgo I am definitely ok with 1:03 [23:41:10] ok.. changed it :) [23:41:13] late response I know heh [23:41:23] was troubleshooting database issues with awight [23:55:00] ppena who should be on it?