[13:46:23] happy eclipse day fr-tech! [14:20:13] woot, happy eclipse day! we've got our pinhole camera ready to go [14:38:24] Jeff_Green, nice! [14:50:21] (PS8) Mepps: WIP rectify orphan function [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/369996 [14:51:48] (CR) jerkins-bot: [V: -1] WIP rectify orphan function [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/369996 (owner: Mepps) [14:59:38] whats a fund? [15:00:22] money for donation? [15:02:58] it's cloudy here :( [15:10:03] thought i'd miss it completely at this latitude, but apparently we can see 15% of the sun covered [15:20:28] morning ejegg! [15:21:08] purplex88: yes, we're the people who help the foundation get donations [15:21:13] hi mepps! [15:21:28] I'm only sorta-on again today, doing a few meetings [15:21:42] ah okay ejegg [15:21:46] Do i get my money back once i pay? [15:22:06] dstrine hangouts just froze on me, rejoining [15:22:35] purplex88: if you want a refund, you can email problemsdonating@wikimedia.org [15:23:35] i am trying to understand whats a fund / loan / donation etc.. [15:24:06] purplex88: donations are money given voluntarily to support a project [15:24:26] many people who like wikipedia give money to the foundation that pays for programming and servers [15:25:15] the foundation uses the money to keep wikipedia working and to promote wikipedia in places where it is unknown [15:59:12] crowd funding = free money? [16:12:06] fundraising != crowd funding [16:14:02] but these come from crowd of people [17:12:11] (PS9) Mepps: WIP rectify orphan function [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/369996 [20:02:36] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: civi dedupe: offer dedupe option in a regular search - https://phabricator.wikimedia.org/T151270#3539492 (DStrine) [20:02:38] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: Civi dedupe: enable Find & Merge Duplicate Contacts to use search results - https://phabricator.wikimedia.org/T146997#3539494 (DStrine) [20:03:09] eileen: o/ [20:03:15] back in NZ? [20:04:22] Fundraising Sprint Prank Seatbelt, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: civi dedupe: offer dedupe option in a regular search - https://phabricator.wikimedia.org/T151270#2812756 (DStrine) [20:10:57] cwd yep [20:11:28] wb [20:12:31] cwd are you joining call in 3? [20:13:06] ah i hadn't heard anything [20:13:09] i certainly can [20:14:36] Fundraising Sprint Prank Seatbelt, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Fr-CiviCRM-dedupe-FY2017/18: dedupe or other feature? cleanup - https://phabricator.wikimedia.org/T173776#3539506 (DStrine) [20:14:47] fr-tech: any topics you'd like to discuss? [20:15:02] Fundraising Sprint Prank Seatbelt, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Fr-CiviCRM-dedupe-FY2017/18: dedupe or other feature? cleanup names and addresses in civi - https://phabricator.wikimedia.org/T173776#3539518 (DStrine) [20:15:17] replag [20:15:40] yeah, jeff and i have been talking about that [20:16:23] any new ideas on it? [20:16:45] cwd: do you want to use the ops call to talk about this? [20:16:58] my current theory is: if we run banners and dedupe at the same time, we'll get lag [20:17:09] dstrine: certainly can though IRC is fine with me too [20:17:23] Jeff_Green will have useful insights as well [20:17:44] hi hi [20:17:48] last year we turned dedupe off at first, but then we turned it on again & it ran ok - I feel like it survived at the level our 1-hour tests are at [20:18:14] yep, that is my memory as well [20:18:36] jeff whipped up this fancy graph: https://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&title=mysql_innodb_buffer_pool_pages_dirty&mreg[]=%5Emysql_innodb_buffer_pool_pages_dirty%24&hreg[]=%28frdb%7Cfrdev%29&aggregate=1&hl=frdev1001.frack.eqiad.wmnet%7CFundraising+eqiad%2Cfrdb1002.frack.eqiad.wmnet%7CFundraising+eqiad%2Cfrdb1001.frack.eqiad.wmnet%7CFundraising+eqiad [20:19:06] one thing it makes me think is that 1002 being the read server is creating extra lock contention [20:19:15] Fundraising Sprint Outie Inverter, Fundraising Sprint Prank Seatbelt, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, FR-Ingenico: Deal with recurring donations stuck in 'In Progress' status - https://phabricator.wikimedia.org/T171868#3478795 (Eileenmcnaughton) note that civi increments... [20:19:26] that's pretty [20:19:37] I'm finding myself overwhelmed with unknowns re. the replag problem, maybe it would be helpful to collaboratively assemble a timeline to see if we can correlate code vs config vs hardware vs banner to what's happening performance-wise [20:19:52] so which server does frdb1002 potentially lock? [20:20:13] just itself [20:20:20] but it might be blocking the replication thread [20:20:36] eileen do you mean what are the downstream impacts of replag? [20:20:39] replication is a single thread going on in the background [20:20:47] compared to the master which can do many things at once [20:21:03] so intuitively it seems like it could easily overwhelm the replicas in times of stress [20:21:12] I mean, if frdb1002 is locking - then it locks a copy of the tables it holds? [20:21:32] well [20:21:43] i meant, if there are read queries coming from the front end, they might clash with the replication thread [20:21:51] or slow it down at least [20:21:59] so if it were just table read locks then yes, the only impact would be that queries hitting there would have stale data [20:22:14] but I think what's actually happening is more complicated than that [20:22:32] did you get anything from the dbas? [20:23:10] I ~think~ it's getting to a state where innodb has too much data in it's in-RAM buffer, and decides it needs to flush, and that process can impact client connections like "too busy, go away" [20:23:48] DBAs have put some comments here https://phabricator.wikimedia.org/T173472 [20:24:09] there are a few main points: [20:24:43] - evidence of heavy insert/delete activity from application, possibly in large blocks at once [20:26:42] What does large blocks mean (we are doing many individual inserts mostly) [20:27:30] read locks will still block write locks right? [20:27:48] cwd I'm still confused about that! [20:28:07] ie. they will do on the same server, but I don't understand how having the read server plays in [20:28:12] i'm sure it's oversimplifying [20:28:30] but like the replication thread is just sitting there replaying every write that happened on the master [20:28:44] but a read lock could still prevent that thread from getting its write lock [20:28:54] sorry, I got a phone call IRL, catching up... [20:28:54] in a timely fashion [20:28:56] right, but it seems to not commit them until it is caught up [20:29:23] sorry what do you mean caught up? [20:29:36] * Jeff_Green now catching up on on backscroll [20:30:07] well it seems like what we were seeing was the open query on the process_list was getting longer & longer & when all was done it would just disappear [20:30:42] when things are laggy? my experience has been i see the query constantly changing [20:30:46] in terms of that graph - do all the lags represent a problem or only the big ones with red all over them? [20:31:08] cwd I mean when we were looking before you restarted the services [20:31:16] (the other week) [20:31:35] so i kept seeing the "closing tables" thing [20:31:40] but i'm pretty sure that was a red herring [20:31:44] eileen: that graph is actually not replication lag, it's basically buffered changes in innodb, that will need to be flushed to disk [20:31:49] i saw the same thing when they are caught up [20:32:28] just a note about the dedupe script - the dedupe script has a bit of code at the start that counts contributions in the last minutes & if there are more than a certain threshhold it doesn't start [20:32:32] the blue ones must be the silverpop export [20:32:37] those huge spikes seem to correlate well to severe replag events, and possibly to that server turning away clients [20:33:49] eileen: that's good, and i'm not trying to put the blame on dedupe here [20:33:51] So frdb1002 is the read server for all queries? [20:33:52] I think it's safe to say we haven't found the smoking gun [20:33:52] there are a million factors [20:34:25] eileen: yeah [20:34:30] eileen: nope, almost all read activity happens on the master b/c civi/drupal is not set up to work with more than one database server [20:34:33] although i suppose that depends on application support [20:34:36] ah yeah [20:34:57] ignore me :P [20:34:58] ok - so frdb1002 is mostly just doing replication [20:35:01] we have a task somewhere in the annals of civicrm feature requests re. that [20:35:06] & that is the one that had the disk problems? [20:35:23] it's doing replication and I believe we have some queue consumers that talk to it for specific long queries, outside of civicrm [20:35:41] i wonder if the donations consumer does [20:35:44] I feel like we need to try to do some high load & experiment with which jobs are running [20:35:48] frdb2001 was the one that had a failed disk, long since repaired [20:36:17] so far nothing will randomly start on fire outside the context of high traffic to the site [20:36:24] eileen: yep, and also it would be nice to have a timeline of what has changed since 1/2017 code and config wise [20:36:27] and the long tail of queue consumers [20:36:28] so why the difference between 1001 & 1002 [20:37:08] 1001 is the master so it is doing a pretty different type of work [20:37:09] we don't fully understand that yet, but I think there's some level of adaptiveness to how mysql/innodb decides to checkpoint and buffer [20:38:48] if the donation queue consumer is handing off any expensive reads to 1002 that could explain the asymmetry [20:40:14] eileen: another thing that came up in discussion with the DBAs was that deletes put undue strain on innodb, so anywhere we're doing a lot of deletes could be interesting [20:40:52] ok - only cache tables stand out on that front [20:41:23] (I assume we replicate those as well although we could choose not to) [20:41:40] cwd: when you've seen the "closing tables' thing do you remember offhand if there were table(s) mentioned in the query? [20:42:16] when did we turn on the rollback tables thing? [20:42:20] it varied [20:42:25] cwd: ok [20:42:32] it's always changing afaict [20:42:46] refresh the processlist as fast as possible and i never saw the same thing twice [20:43:07] k [20:43:10] but i saw lines for "closing tables" at least half the time [20:43:25] but it looked about the same when it was caught up [20:43:32] so i don't know if it means anything [20:44:16] When a fundraising event happens it could be triggered by banners or by an email. If the former then there will be the donation queue, thank you queue, dedupe queue. If the latter there will also be silverpop data retreival [20:44:43] so I guess we should get some sort of calendar of what banner / email tests are planned [20:45:22] yeah [20:45:55] eileen: do you remember when we added all those triggers and the rollback tables? [20:46:28] that was early last year? [20:46:45] that far back? ok [20:46:54] time flies! [20:47:13] 2016-05-02 23:49:55 [20:47:27] kthx [20:48:08] the silverpop job that fetches detailed data & dedupe jobs will be turned off in the early part of big english (although we might turn on the major gifts dedupe) - but it's not 100% clear that will be enough at the moment [20:48:57] ok. the writes for that are on frdev1001 only, but I'm not sure where it does its reads [20:49:40] one thing about the silverpop fetch job is that it uses INSERT IGNORE - I don't know if that causes replicator problems [20:50:32] if it did, I think it would only explain lag on frdev1001 [20:51:05] yeah & silverpop-fetch was probably the first thing to put heavy replication load on for a while [20:51:28] so unclear if there was anything special about it or not [20:51:30] right, those had very long-running queries [20:51:53] not long running, just many quick inserts [20:51:55] so that's one type of pain, and those were easy to see b/c you could see the query take forever on the master, then hit the slaves and take forever there [20:52:18] the fact it LOOKED like a single long query on the slaves was odd [20:52:20] iirc silverpop has horrible select into queries [20:52:30] that is select export, not fetch [20:52:47] the fetch job is literally thousands of tiny insert ignore queries [20:53:02] https://phabricator.wikimedia.org/T157600 [20:53:16] I wonder if the fact it LOOKED like one big query on the slaves suggests it switched to different replication [20:54:00] Jeff_Green: that ticket is the export, there are now 2 jobs with silverpop [20:54:14] ok [20:54:27] the fetch now runs half-hourly but is up-to-date now [20:54:34] k [20:54:48] when did get to up-to-date? [20:55:04] it was behind and backfilling for a while right? [20:55:24] yeah - I'm not 100% sure when it got up to date but I checked just before [20:55:51] select * FROM civicrm_setting where name = 'omnimail_omnirecipient_load'; [20:55:59] there is a timestamp in the 'reply' [20:56:25] backlogged vs caught up would be helpful to know re. assembling a timeline [20:56:45] I know - sometime last week but I don't know more than that [20:57:09] it was still catching up when I last checked - but that was when I first got to Montreal [21:00:00] https://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&title=mysql_slave_lag&mreg[]=%5Emysql_slave_lag%24&hreg[]=%28frdb%7Cfrdev%29&aggregate=1&hl=frdev1001.frack.eqiad.wmnet%7CFundraising+eqiad%2Cfrdb1002.frack.eqiad.wmnet%7CFundraising+eqiad%2Cfrdb1001.frack.eqiad.wmnet%7CFundraising+eqiad [21:00:34] unfortunately we don't have ganglia data going back very far b/c the collector was broken [21:00:46] lemee see if I can pull that data out of logs [21:03:36] eileen: the silverpop fetch--does it do a transaction commit between each insert ignore? [21:04:02] I'm pretty sure it does because there is nothing transactional wrapping it. [21:05:57] I'm away for a 20 mins - I think we should find out from CC & SP what upcoming tests are planned [21:07:40] yeah that sounds like a good idea [21:08:24] i'd also like to figure out what else was running last Weds during the banner test, and how much work we think it was doing [21:10:40] should I email Caitlin & Sam or will you? [21:11:08] go for it, I'll keep spelunking logs and building a timeline [21:11:14] another thing was that was english banners [21:11:21] so a whole other level of traffic [21:11:28] ya, true [21:11:40] I've just realised there is a way to see the timeline on silverpop fetch [21:20:43] it looks like it fetched $1.6 m records over 24 hours - covering 22 July to 10 Aug over the period between 2017-08-16 00:10:34 & 2017-08-16 23:46:29 [21:21:11] but that seems to be UTC - Mon Aug 21 21:20:57 UTC 2017 [21:21:23] so not quite sure the period match [21:21:50] SELECT value, log_date FROM log_civicrm_setting where name = 'omnimail_omnirecipient_load' AND log_date BETWEEN '2017-08-16' AND '2017-08-17'; [21:22:10] shows the last_timestamp incrementing [21:30:26] Jeff_Green: have you considered trying row based replication? [21:30:44] yeah [21:30:53] we're running mixed replication now [21:31:01] yeah [21:31:05] i think there's an expected performance hit for row-based replication [21:31:12] oh really? [21:31:33] my reading lead me to believe that it would result in much larger bin logs, but faster performance overall [21:32:05] since it would only have to lock one table at a time [21:32:10] and not for extended number crunching [21:32:48] it might vary by query type [21:33:17] honestly I don't know, we made the leap from statement to mixed maybe 2 years ago and stopped there [21:35:01] yeah [21:35:20] it sounds like mixed replication will almost always be statement except under certain conditions [21:35:33] with non-deterministic behavior like DATE() and so on [21:36:01] but i have not done any auditing of queries to see how often those conditions arise [22:15:53] ok - I emailed cc & sp [22:38:00] AndyRussG: I seem to have broken centralnotice a little [22:38:05] AndyRussG: https://meta.wikimedia.org/wiki/Special:CentralNotice [22:38:20] AndyRussG: there are two campaign entries [22:38:44] mmmm? [22:39:21] Seddon: ok hmm... how did u do that? [22:39:26] AndyRussG: You archive one and the other archives too [22:40:22] AndyRussG: I just tried to create a campaign [22:40:26] Thats all [22:41:15] Yeah we shouldn't have more than one campaign with the same name... I did in fact file a related bug not long ago, but it only involved when you use unusual characters in the name [22:42:59] Yeah this wasn't intended [22:45:13] AndyRussG: and Seddon I guess we'll have to be carful with names for now and keep that bug high on the backlog