[00:02:14] homebrew [00:02:15] (CR) Jforrester: [C: 2] Add i18n for API module help [extensions/ContributionTracking] - https://gerrit.wikimedia.org/r/169811 (owner: Anomie) [00:02:29] (Merged) jenkins-bot: Add i18n for API module help [extensions/ContributionTracking] - https://gerrit.wikimedia.org/r/169811 (owner: Anomie) [00:02:32] ok found my.cnf, and it's empty [00:03:07] ok, just add these two lines (adjust paths as needed) [00:03:31] general_log_file = /var/log/mysql/mysql.log [00:03:31] general_log = 1 [00:03:52] then restart mysql [00:13:00] so I keep having this error [00:13:08] when I close mysql then restart [00:13:22] it doesn't update "the PID file" [00:13:36] which is something awight and I were troubleshooting [00:13:40] hmm. permissions? [00:13:53] permissions with? [00:14:01] do you do things with sudo on mac? [00:14:10] I do [00:14:25] so do the permissions change when you shut down mysql? [00:15:14] I used to just be able to go mysql.server start [00:15:16] just thinking that trying to start it again without sudo would have permissions issues on the pid file [00:15:17] and everything worked [00:15:21] but now I get this error [00:15:31] ah ok, I'll try that [00:15:59] hmm nope [00:16:10] huh. if you put # signs in front of those two lines in my.cnf, does it start up? [00:18:03] how are you shutting down mysql? [00:18:06] the specific PID file thing is probably caused by the lack of a pidfile path in mysqld.conf [00:18:25] nope :( [00:18:26] for now, shutdown is like "killall -9 safe_mysqld" :) [00:18:58] awight when I do that it says no matching processes were found [00:19:04] last time, we had to ps auxxww|grep mysqld and snipe the processes manually [00:19:05] and when I get the individual process, [00:19:11] lemme see what the exact names are [00:19:13] I get the process id and kill -9 [00:19:21] and it says process not found [00:19:26] then I try to look again [00:19:28] hrm [00:19:31] and the id has changed [00:19:31] killall -9 mysqld_safe ; killall -9 mysqld [00:19:34] werry odd [00:19:36] yep. [00:19:40] it changes every time hahah [00:20:03] the parent respawns its child, so you have to kill that first. pleasant unix 201 exercise [00:20:15] Actually, I think unix is graded in circles of hell [00:20:22] purgatory7 [00:20:30] yeah awight both of those produce the same result [00:20:37] which is that it's saying those processes are not found [00:20:43] with and without sudo [00:20:54] how do I kill the parent? [00:21:08] in that case, run the "ps auxxww|grep mysqld" command [00:21:10] does mysql.server stop or mysql.server restart do anything? [00:21:26] find the mysqld_safe process id and kill that first. then kill the mysqld process [00:21:36] ejegg: nah, they require the PID to do their job [00:21:48] how do I find the mysqld_safe process id? [00:21:51] I guess, since the mysqld.conf file is empty, there is no pid [00:21:55] those suggestions didn't work [00:21:56] pizzzacat: you run the ps...grep [00:22:01] I did that [00:22:03] what's the output? [00:22:18] ps auxxww|grep /usr/local/Cellar/mysql/5.6.17_1/bin/mysqld [00:22:19] sherahsmith 24713 0.0 0.0 2423368 188 s009 R+ 4:21PM 0:00.00 grep --color=auto /usr/local/Cellar/mysql/5.6.17_1/bin/mysqld [00:22:26] it worked, then! [00:22:41] now, apply 12V to [00:22:48] mysql.service start [00:22:48] mysql.server start [00:22:48] Starting MySQL [00:22:48] . ERROR! The server quit without updating PID file (/usr/local/var/mysql/sherah-smith-2.corp.wikimedia.org.pid). [00:22:54] meh [00:23:15] I'm going to killall [00:23:22] * awight ducks under the desk [00:23:24] -9 manhattans [00:23:31] you already did kill some more [00:23:49] unless ps shows that jumpstarting the service worked? [00:24:00] * awight grabs some more popcorn [00:24:04] en ingles por favor [00:24:29] imagine that you only have 4 tools that work [00:24:37] generous [00:24:50] yes this is no Gilligan's iland shite [00:25:02] what do you mean "ps shows that jumpstarting the service worked"? [00:25:12] searchlight: ps auxxww|grep mysqld [00:25:22] hammer: kill -9 NUM [00:25:41] sherahsmith 24830 0.0 0.0 2423368 188 s009 R+ 4:25PM 0:00.00 grep --color=auto mysqld [00:25:41] battry wires attached to each arm: mysql.server start [00:25:53] that should totally be enuf to get your server going again [00:26:10] move that mysqld.conf file to another name, for now [00:26:13] um but that's what I'm saying doesn't work [00:26:24] ok I'm confused [00:26:51] let's back up to "ps shows that jumpstarting the service worked" [00:27:01] and only use commands [00:27:07] so I know exactly what to type [00:27:09] your ps shows that there's no server running [00:27:12] goot [00:27:13] because sometimes I can't tell what's commentary [00:27:18] and what's commands [00:27:25] sorry :( [00:27:27] then u try to start the mysql.server. fail? [00:27:46] umm [00:27:46] perhaps it's failing because your conf file is bad. try renaming that file. [00:27:52] wait!! wait wait [00:27:59] hehe sorry, it's hard to type as I eat this popcorn [00:28:02] I can't tell what you're saying awight [00:28:15] so here's where I'm at [00:28:18] k [00:28:30] I got the error [00:28:41] saying the server quit without updating the PID file [00:28:45] then I typed ps auxxww|grep mysqld [00:28:56] and no process was running... [00:28:58] then I got sherahsmith 24830 0.0 0.0 2423368 188 s009 R+ 4:25PM 0:00.00 grep --color=auto mysqld [00:29:06] so what is my next step? [00:29:07] cool. [00:29:17] try renaming the config file [00:29:39] ok which config file? [00:31:03] awight ^ [00:31:19] I don't think u told us the exact path, but somewhere under /usr/local/Cellar/mysql/5.6.17_1 [00:31:46] oh oh ok [00:31:59] rename it just to get it hidden? [00:32:02] or to something specific? [00:32:24] yeah just to hide it [00:32:33] then, try starting the server again [00:32:45] ok! [00:32:56] nope same error [00:33:10] blahhh mysql [00:33:46] and ps... says no server is running? [00:33:51] pizzzacat1: ^ [00:35:15] while you're having fun, try tail /usr/local/var/mysql/Sherah-Smith-2.local.err [00:35:38] and even, ll /usr/local/var/mysql/Sherah-Smith-2.local.err [00:37:24] ps says sherahsmith 25042 0.0 0.0 2432784 516 s009 S+ 4:37PM 0:00.00 grep --color=auto mysqld awight [00:38:05] ewulczyn: I wish we had the graphix from that notebook you emailed [00:38:58] awight: do you want the plots seperately? [00:39:01] ewulczyn: however, there's this nasty aliasing effect we are fixing soon, where allocations only come in increments of 100%/30 [00:39:14] meh whatever, if it happens at a round hour, I'm pretty sure I know what's happening [00:41:33] My guess is that another campaign becomes active, and it causes the allocations to change slightly. Because of the 3.3% rounding effect, sometimes a banner will be robbed of some likelihood. you'd have to look at the allocations during that test, and see whether it's true, though. [00:41:38] ewulczyn: ^ [00:41:51] pizzzacat: anything in the error logs? [00:42:26] pizzzacat: also, I would suggest running mysql.service with its -v or --debug options, but I don't see documentation about that, anywhere. Where does this janky brew package come from? [00:43:52] ewulczyn: hmm, but that's much more than a 3.3% difference... [00:44:22] aargh, apparently you can't add order_id to staged_vars without some nasty side effects. Thank goodness for these tests. [00:44:26] ah no it still might be the case, because the worst case situation is actually 3.3% vs 6.6%, between a/b [00:44:32] ewulczyn: nooo [00:44:34] sorry [00:44:38] ejegg: noooo! [00:44:59] srsly... [00:45:13] NOOO [00:45:22] order_id is totally off in its own evil bubble [00:45:32] maybe.... good magic [00:45:35] but definitely unique magic [00:47:36] ewulczyn: I don't understand why your analysis is blocked, though--can't you just normalize out # of impressions? [00:48:00] it's the "combine" you can't do? [00:49:03] well, I can override normalizeOrderID and regenerateOrderID to set those transaction-specific values after calling parent. Blecch. [00:49:17] awight: the analysis is not blocked. The last comment was directed at Megan to wanr her against combinging the data. I am confused about how the campaigns were interacting in the way shown [00:51:41] ewulczyn: it's terrible, but imagine you have four banners equally sharing allocations. [00:51:59] Because we quantize to 30 allocation "slots", you cannot have four exactly equal banners. [00:52:27] Instead, two will be at 7/30 and two at 8/30 of the allocation [00:53:05] I think that's what happens at day 3 of your test--some other campaign starts to overlap with this test, and it causes the relative allocations between A and B to be skewed by this +/- 3.33% error. [00:53:30] ... which can add up to a very large effect if you were already only getting 1 or 2 slots for each banner. [00:54:11] Fortunately, AndyRussG|hmwork is about to fix this by making the allocations proportions continuous. After hmwork ;) [00:56:06] !log disabling all queue consumers. [00:56:15] Logged the message, Master [00:56:21] oh wow. I blew away the entire wmf_contribution_extra table. [00:57:01] well. [00:57:09] fired. [00:57:33] oof. tough day. [00:57:43] I don't even know... [00:57:50] That is not really recoverable... [00:58:10] only from binary logging [00:58:16] eek [01:01:55] awight sorry I had to take a little break [01:02:18] how do I get mysql back up and running? this is driving me crazy [01:02:55] pizzzacat: I'm too busy crying at th emoment [01:03:02] for myself [01:03:02] ha ok [01:03:09] I'm supposed to go to a happy hour thing [01:03:23] I think I spent 33% of my day on this issue [01:03:57] I'm particularly offended at how it eventually worked, one of the times, and now it's borken again. [01:05:41] awight: extra_restore still exists. not what you need? [01:06:23] ejegg: it's about 20,000 donations old... [01:06:30] erk [01:06:40] (PS1) Ssmith: update gauge library and presentation [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/171486 [01:06:46] oh, same backup as dev_civicrm? [01:06:58] yeah I know. [01:07:04] well I gotta run, thanks for the help. [01:07:04] yeah. I don't feel comfortable leaving a gap in the db, cos if the IDs get duplicated we'll be bailing that sewage for days. [01:07:11] pizzzacat: hehe, good call [01:07:19] cya pizzzacat [01:09:19] our Jenkins dashboard is... really dark, now. [01:10:16] this binary logging you speak of... we have it? [01:10:27] yeah, I don't know how long the files last for, though [01:10:52] hopefully, since our last backup at least. [01:11:08] Sounds like this may merit pestering Jeff [01:11:27] for sure, I hopefully SMS'd the right tel no. [01:16:05] awight: woo interesting theory... Yeah we will fix it... :) [01:16:59] * awight continues to feel effusively positive about this planned work [01:18:56] fun fun yeah :D [01:19:25] school's out, for ever! [01:26:35] and I'm out for the evening. Let me know if I can help with anything [01:27:04] ejegg|away: kbye! [02:01:23] Jeff_Green: I just can't believe... you even looked me in the eye and mentioned that it was dangerous to restore tables. and i looked tasty with ketchup. [02:01:54] fwiw, the first thing I tried to do was load that data into a scratch database. Can we have one of those, in the future? [02:02:46] awight: we're going to need to stop mysql on db1025 for a minute [02:02:47] I was too lazy to dig out the pgehres password [02:02:55] Jeff_Green: sure. most jobs are off [02:03:13] i think we should rethink how we do bulk stuff on the prod db's [02:03:18] yep. [02:03:32] always always a script, preferrably CR'd [02:03:37] we could have stopped replication to db1008 to cover ass for example [02:03:45] that would have been real nice [02:03:46] anyway, for another day [02:03:52] I could have stopped all the queue consumers [02:04:07] so right now what do we need before stopping mysql on db1025? [02:04:14] I think it's fine. [02:04:35] I'll turn off the Civi web gui, one moment [02:04:42] what about payments? [02:05:14] !log CRM: drush vset maintenance_mode 1 [02:05:25] Logged the message, Master [02:05:29] Jeff_Green: ooof you're right [02:05:30] SPOF's [02:05:36] awight: is dev_civicrm.wmf_contribution_extra still clean? [02:05:38] yeah payments needs to come down [02:05:42] also smashpigs [02:05:53] Jeff_Green: yeah we can start with that data [02:06:03] it's 20,000 rows behind [02:06:24] Let me know if it's going to be too much of a headache to replay those last rows [02:06:42] it's possible that we can recover most of it from nightly audit files [02:07:10] ... major gifts however... not so nice, if they were doing anything [02:07:26] Jeff_Green: oh, not dev_civicrm [02:07:40] Jeff_Green: start with civicrm.wmf_contribution_extra_restore [02:07:51] dev_civicrm is 1M rows behind [02:08:53] Jeff_Green: sorry, nvm you're right [02:09:06] dev_civicrm was updated, it's 20k behind, now [02:10:05] awight: so lutetium dev_civicrm.wmf_contribution_extra right? we'll start by dumping that and restoring it to db1025 [02:10:33] Jeff_Green: yep. that data is already available on db1025 though, if that's faster [02:10:40] as civicrm.wmf_contribution_extra_restore [02:10:45] that should be clean, still [02:11:00] Jeff_Green: is it possible to do this without taking db1025 down, btw? [02:11:14] payments outage is a big deal for tests, unfortunately [02:11:26] we want to take db1025 down to cleanly copy binlogs [02:12:02] I can copy most of them so we don't need it down for more than a few minutes [02:14:21] Jeff_Green: cool [02:14:36] Jeff_Green: I need to do a few minutes of baby stuff... brb [02:35:22] awight: questions [02:35:23] Jeff_Green: ok I'm here, let me know if I can do anything [02:35:27] k [02:36:40] is there any reason we should not simply rename the restored table? [02:36:50] Jeff_Green: that would be great [02:36:53] as a starting point [02:36:54] if possible [02:37:07] is springle's backup done? [02:37:18] ok. and once we do that, will anything use or write to that table in our current operating state? [02:37:28] no [02:38:06] alright [02:38:33] Jeff_Green: lemme know when you're ready to disable payments [02:38:36] and once we do that we'll have a gap from when the dump ran to just before the table first got corrupted, that's about 20K rows? [02:38:47] i think we're ok without stopping mysql [02:39:03] oh wow k [02:39:10] yes to your summary [02:39:38] so backfilling that from binlogs will be fairly a pain in the ass [02:40:02] I bet [02:40:08] What can we do? [02:40:43] basically we can get the raw sql in order, but pulling out what affected just that table in the affected timespan will be...a process. [02:40:52] If we have to lose those entries it's probably equivalent to many hours of coverup by Donor Services... [02:41:01] coverup? [02:41:18] they'll probably end up answering a lot of emails [02:41:24] ok [02:41:37] How hard do u think this is? [02:41:43] not sure yet [02:41:52] I was imagining, we could go back to the 13-hr-old backup [02:41:55] one important question--how predictable are the queries that write to the table? [02:41:57] then replay everything. [02:42:06] Jeff_Green: oof. Mostly deterministic. [02:42:11] dammit. [02:42:16] awight: is there any pattern to the statements that write to the table? say all INSERT, or all have an SQL Comment [02:42:36] springle: we only care about insert and update... [02:42:48] never REPLACE or DELETE? [02:42:50] they probably do conform to a small number of patterns, if that helps [02:43:03] no, I don't think so, but give me a minute to check. [02:43:28] so what if we just replayed the binlogs against dev_civicrm [02:43:33] there is a foreign key with delete cascade there [02:43:38] Yeah I don't think we ever replace [02:43:52] oof, there is some data which is updated by triggers, but I'm okay omitting that [02:43:57] any other tables that will break, or be broken, by blindly replaying stuff? [02:44:08] Jeff_Green: I'm fine with the full replay [02:44:19] springle: if we replay on dev_civicrm it doesn't matter what happens to the other tables [02:44:27] cool. [02:44:29] at least, not until we're talking about foreign keys that get out of sync [02:44:37] ok [02:44:43] just raising questions [02:45:08] dev_civicrm is just a scratch db people can beat on until the next time we refresh it [02:48:06] awight what about reconstructing and/or validating restored data from logs? is that possible? [02:49:07] Jeff_Green: from the application domain logs, like audit? [02:49:20] yes it's possible for the online stuff, but we'll lose any Major Gifts work [02:49:23] if any. [02:49:44] ok [02:49:49] something like http://paste.debian.net/130478/ [02:49:53] Jeff_Green: ^ [02:49:56] looking [02:50:44] springle: i'll copy the binlogs over to lutetium and we can try it [02:51:05] would need to load ignoring errors i guess. there will be PK clashes, and the multiple restores will be troublesome [02:51:07] statements are each on one line? [02:51:26] awight: we'll find out ;) [02:51:49] hehehe [02:53:27] i suppose we could skip the grep and just use --database=civicrm, since we don't care about lutetium dev_civicrm [02:57:23] many updates seem to be of this form: /* https://civicrm.wikimedia.org/user/[REDACTED] */ UPDATE wmf_contribution_extra SET no_thank_you = '',postmark_date = null WHERE id = 11065361 [02:57:36] all having the url comment [02:57:40] alright the binlogs are there [02:58:51] not many inserts showing up. possibly multiline [02:58:58] springle: yeah I would expect all the lines to have that URL comment, but fwiw that's shared by anything else coming from our Drupal/CiviCRM codebase. [02:59:51] also, I'm sure it's painfully obvious by now, but the place to stop is when I type "drop table wmf_contribution_extra". [02:59:56] * awight hangs head [03:01:00] i'm gonna print you a barn star of shame :-P [03:01:17] tattoo, pls [03:01:32] my scarlet letter [03:01:36] ha [03:01:46] acquiring next target: enwiki [03:02:10] no... dewiki [03:02:30] their counterstrike will completely annihilate all our bases :D [03:02:40] yeah, i think we need to avoid the grepping and just mysqlbinlog --database=civicrm [03:03:03] too many conditions to try and meet otherwise [03:03:25] springle: wrt your earlier comment, there should be roughly 20,000 insert statements between the last backup, and my fail [03:03:35] donno how many updates. [03:05:22] awight looking at keys in this table...which of them could concievably become out of sync if we the binlog replay timeframe doesn't line up? [03:06:12] I think, just id, which is not actually a problem [03:06:21] ok [03:07:07] * awight sucks in breath [03:07:15] where does unique_entity_id come from? [03:07:37] thinking more, if we are replaying all the tables, entity_id could go bad, and that would be bad [03:08:08] If we can get max(civicrm_contribution.id) to line up, that will be good enuf [03:08:45] in dev_civicrm, that is 8269914 [03:09:16] if you search for that in the binlog, you should find statements about insert into wmf_contribution extra set entity_id=8269914 [03:09:20] right there, I think. [03:09:24] where is that from though? [03:09:38] is that an autoincrement from some other table? [03:09:45] select max(civicrm_contribution.id) from dev_civicrm. [03:09:51] autoincrement from that table. [03:10:28] * awight relives my "drop table" nadir [03:11:22] If I can look at this binlog, I might be able to help identify the step [03:12:42] After today's performance though, I understand if u want to keep me far away from db1025 :) [03:13:01] sec, I'll chown the files [03:13:34] lutetium:/srv/binlogs_from_db102 [03:13:36] k [03:14:24] Jeff_Green: do you have the plain SQL version somewhere? [03:15:18] mysqlbinlog > out.sql [03:15:39] I only see *bin.* [03:15:44] oh I get it [03:15:56] well lemme run that paste you uploaded, then [03:15:59] for some reason rsync didn't want to set the file dates... [03:17:00] ok i'm creating replay.sql.gz [03:17:29] then I'll try to trim at head and tail to match the dev_civicrm state, up to my drop table statement. [03:17:58] will be many GBs... [03:20:12] sorry--what are the BINLOG statements? [03:20:21] it's all base64-encoded [03:26:08] awight: frack uses MIXED binlog format; that is, some will be plain SQL statements and some will be actual binary row copies [03:26:40] springle: ok, interesting to see it in practice [03:26:41] the latter should only occur for statements that are non-deterministic, and some fring cases [03:27:12] you should determine if the statements you need are all using plain SQL format, like the example UPDATEs i posted above [03:27:38] hmm [03:28:17] if not, we'll have to try replaying against the restored db and hope. the binary rows would still be applied, but it's harder to identify stuff [03:28:18] Can you check for triggers on the wmf_contribution_extra table? [03:28:42] there are no triggers in civicrm [03:28:57] oh good, thx [03:29:06] the drop table statement should be plain SQL [03:29:11] hard to miss :) [03:29:18] well, then I think the queries are all deterministic and are hopefully recorded as plain SQL [03:30:27] Okay, I digested all the binlogs into replay.sql.gz [03:31:33] the backup was somewhere shortly after line 94875100, trying to narrow that down now... [03:37:02] it turns out it's really a long process to gunzip 300MB of cruft [03:37:11] yes it does [03:37:39] oh 800MB, zipped [03:38:14] 16 cpus? ok using them now [03:43:50] how do you want the results, should I trim the SQL concatenation, or give you the "# at NUM" lines? [03:45:48] why not generate the SQL you need and try fixing the table on lutetium? you have the best knowledge of what the data should look like... [03:45:54] k, I found the first cutpoint [03:46:12] do I need special perms to run BINLOG? [03:46:38] awight to feed it back into the db? [03:46:53] if you can run the mysql client, just pipe the sql to it, or use "source " from mysql cli [03:47:14] awight, before you do let's make a copy of that table [03:47:23] mysql dev_civicrm otherwise I'll have to restore the full db if it fries [03:47:44] gimme a minute, I'll copy it aside [03:47:50] well, I'm planning to replay all tables fwiw [03:48:10] orlly? ok, nm then. [03:48:19] hmm. [03:48:36] thinking if this goes south we're going to have an hour wait to restore the db [03:49:23] oooooo oof [03:49:26] awight: your 'less' is causing lutetium to swapdeath :-P [03:49:29] hehe [03:49:33] ok killing [03:49:42] i can stop mysql and free up RAM if needed [03:49:49] should be freed now [03:49:52] k [03:50:01] right, ok I see those alerts now [03:50:49] i installed pigz in case you want to try a faster decompressor [03:51:30] yes [03:52:50] wow. faster [03:57:45] The winning replay script is exact_replay.sql.gz [03:58:05] I'll go ahead and try pigz -dc exact_replay.sql.gz | mysql dev_civicrm [03:58:11] Jeff_Green: shall I? [03:58:25] sure [03:58:27] k [03:58:37] ERROR 1227 (42000) at line 190: Access denied; you need (at least one of) the SUPER privilege(s) for this operation [03:58:48] ah [03:58:56] I can run it for you [03:59:16] let's... tail -n +190 [03:59:38] yeah [03:59:48] tail what? [04:00:12] there were some insert statements in the first 189 lines, so I'm assuming those entities exist now? [04:00:15]