[00:00:05] RECOVERY - check_mysql on frdb1002 is OK: Uptime: 325 Threads: 2 Questions: 187996 Slow queries: 2 Opens: 135 Flush tables: 1 Open tables: 190 Queries per second avg: 578.449 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [00:01:11] * cwd has no idea [00:09:21] Hey AndyRussG, are you there? Got a question about banner allocation in CN [00:12:02] Ok AndyRussG, maybe a false alarm :) We have had a Japanese banner pre-test up since 00:00 UTC (11 minutes) and banner allocation didn't list any banners for jaJP users until just now. [00:20:14] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3348 [00:25:14] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3164 [00:30:14] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2989 [00:35:05] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2096 [00:40:14] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 722095 Threads: 2 Questions: 21724869 Slow queries: 3856 Opens: 12685 Flush tables: 1 Open tables: 598 Queries per second avg: 30.085 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [01:19:03] Fundraising Sprint Judgement Suspenders, Fundraising Sprint Kickstopper, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, and 5 others: retrieve lists of contacts who received a particular mailing - https://phabricator.wikimedia.org/T161762#3529776 (CCogdill_WMF) Yep, awesome... [01:50:14] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3467 [01:53:35] well that fits with the patterns [01:53:44] i am going to restart mysql on there too [01:53:48] because i'm tired [01:55:15] PROBLEM - check_mysql on frdb2001 is CRITICAL: Cant connect to local MySQL server through socket /var/run/mysqld/mysqld.sock (2) [01:57:04] hey for the first time restarting the daemon did not fix it [01:57:07] that's nice [01:58:09] that's nice [01:58:20] ahem [02:00:14] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3522 [02:05:14] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3452 [02:10:05] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3145 [02:15:14] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2654 [02:20:05] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1979 [02:25:14] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 1513 Threads: 3 Questions: 725584 Slow queries: 2 Opens: 217 Flush tables: 1 Open tables: 219 Queries per second avg: 479.566 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [02:27:42] spatton: hey! I'm around-ish.... [06:41:28] Fundraising-Backlog, fundraising-tech-ops: Frack replag master thread - https://phabricator.wikimedia.org/T173472#3530012 (Marostegui) Hi, Did you guys have time to set up the graphs we requested so we can better check the problems? Some facts that we saw during the last troubleshooting: - Setting i... [15:34:08] Fundraising-Backlog, fundraising-tech-ops: Frack replag master thread - https://phabricator.wikimedia.org/T173472#3530966 (cwdent) Last night I ran into the first instance where restarting the daemon did NOT cause the lag to drop, on frdb2001. After about an hour it went down on its own. Not that resta... [16:00:08] Fundraising-Backlog, Wikimedia-Fundraising-Banners: Safari issue with Other amount option in banner: mask currency code letters - https://phabricator.wikimedia.org/T173431#3531104 (Pcoombe) Open>Resolved Fixed with [[ https://meta.wikimedia.org/w/index.php?title=MediaWiki:FundraisingBanners/CoreJ... [16:32:13] (PS1) Ejegg: Merge branch 'master' into deployment [wikimedia/fundraising/SmashPig] (deployment) - https://gerrit.wikimedia.org/r/372412 [16:32:22] (CR) Ejegg: [C: 2] Merge branch 'master' into deployment [wikimedia/fundraising/SmashPig] (deployment) - https://gerrit.wikimedia.org/r/372412 (owner: Ejegg) [16:33:15] (Merged) jenkins-bot: Merge branch 'master' into deployment [wikimedia/fundraising/SmashPig] (deployment) - https://gerrit.wikimedia.org/r/372412 (owner: Ejegg) [16:45:09] !log updated SmashPig from c501f53aa0394a8f99f4b9fd87b3c8b8294511e0 to 98c55161deffed8364043a27089e7d47cddc3059 [16:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:24] Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint Prank Seatbelt, Fundraising-Backlog, Patch-For-Review: Strict error message in logs (quick tidy up) - https://phabricator.wikimedia.org/T171560#3531231 (Ejegg) Open>Resolved [16:51:40] Fundraising-Backlog, fundraising-tech-ops: Frack replag master thread - https://phabricator.wikimedia.org/T173472#3531233 (jcrespo) I see lots of inserts happening at the time, are you creating (or doing it implicitly) transactions with large insert batches? That would explain if mysql would be delayed w... [17:39:20] cwd: hey, I think we almost had the fr dash working earlier [17:40:16] we just needed to update the IP address for the oauth back channel [17:40:25] right? [17:43:16] err, sorry, you're probably still on the replag case [17:45:06] Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Privacy: Add state or city level geotargeting to CentralNotice - https://phabricator.wikimedia.org/T152297#3531414 (Dispenser) At the [[https://en.wikipedia.org/wiki/Wikipedia:Meetup/NYC/AfroCrowd/Oct2016-BPL|October AfroCROWD]] I asked how peo... [17:49:08] ejegg: i can look in to that [17:50:03] cwd cool, only if you're totally not doing anything else [17:50:50] real low priority, just happened to occur to me now [17:51:05] cool, sounds like it should be dead simple [17:51:09] i will add it to the list [17:51:31] if it's anything more than changing the settings, let's postpone till later [17:51:48] cool [17:53:14] ejegg: got hte ball rolling here: https://phabricator.wikimedia.org/T173472 [17:53:28] i think this is going to end up being a combination of factors [17:53:54] oh cool, looks like some good input [17:54:08] we can probably squeeze some better performance out [17:54:13] yay! [17:54:18] but one thing to consider is that replication is single threaded [17:54:37] hmm, also should go on a large txn hunt [17:54:37] so it might end up that we are actually just overwhelming it [17:54:46] that too [17:55:09] we'll beat it down but it will probably take work from multiple angles [17:56:32] the last round of replag last night did not respond to restarting the daemon [17:56:37] it has developed an immunity [17:59:58] so i think dedupe and the consume_and_thank jobs would be good places to look for large transactions [18:00:25] i'd be surprised if it was consume_and_thank [18:01:06] it's only ever dealing with one donor / donation at a time [18:03:02] yeah, i tend to see it a lot when things are laggy, but i guess it just runs often [18:03:29] i think anything in combination with dedupe is creating a mess [18:35:43] ejegg i know i've gotten crm module tests running before but for some reason i keep running into errors right now [18:35:48] do you hae the command you use? [18:38:26] Fundraising-Backlog, fundraising-tech-ops: Frack replag master thread - https://phabricator.wikimedia.org/T173472#3531679 (cwdent) @jcrespo I have brought the idea up with @Ejegg to scan the most likely jobs for long transactions, sounds like a good thing to do regardless The master is frdb1001 and the... [18:39:02] hi mepps [18:39:15] I use vendor/bin/phpunit [18:39:26] with no arguments [18:39:32] what errors do you see? [18:41:58] if you just run 'phpunit' it can use the system phpunit with the vendor/ libs and get confused [18:45:23] ejegg will that run all the tests for the modules too? i guess you can't isolate them out? [18:46:34] ejegg: it looks like the oauth url for dash is civicrm.wikimedia.org, i don't see an IP? [18:46:47] mepps oh, you can still use --group=Offline2Civicrm or the like [18:46:47] okay that seems to work, i was getting errors trying to isolate them out [18:47:52] cwd there's no 'providerbackendip' setting? [18:48:46] cwd there are 3 bits - providerURL is the one the user gets bounced to [18:48:47] aah that one [18:49:06] providerBackendURL is the one that node uses to make direct requests to civi [18:49:20] aah gotcha [18:49:34] and then providerBackendIP is something we added because the backend url didn't actually resolve to the right IP from inside the cluster [18:49:57] if backendURL DOES resolve to something good, let's just get rid of the backendIP setting [18:50:06] and not screw around with DNS [18:51:10] ejegg: gotcha [18:51:15] it is set to civi's ip [18:51:20] but it's not working? [18:52:46] nope :( [18:52:59] and what happens if you remove it? [18:54:15] i guess it depends what the application is doing? [18:54:30] civicrm.wikimedia.org resolves to the public IP inside the cluster [18:55:01] but the IP setting is definitely civi [18:55:34] does anything besides dash use oauth? [18:55:50] cwd nope [18:56:25] cwd ok, but with the public IP it's gotta have a client cert, huh? [18:57:14] I guess we thought the dns screwiness would be easier than the client cert [18:57:29] yeah totally [18:57:44] but it should work as is, the IP was recycled for the new box [18:57:52] which means that something else is broken [18:58:01] k, lemme see if our modules are all up to date for the right node version [18:58:15] remind me what nodejs version it is? [18:58:20] could certainly be oauth since nothing else uses it [18:58:35] i know nothing about the oauth setup but i can investigate [18:58:39] didn't we upgrade node? [18:58:49] the stack trace points to the evildns module [18:59:13] yeah it's 4.8.2 [18:59:18] cwd thanks! [18:59:26] :) [18:59:32] I bet we never changed the deployed module versions [18:59:33] better than... i think it was 0.8? [18:59:42] oh jeez, yeah [18:59:48] something from the stone age [19:00:11] awesome, it's actually the same version I've got locally installed [19:00:20] nice [19:01:01] i have used a tool called n (horrible name) in the past where you can just select whatever version [19:02:51] relocating, back shortly [19:03:28] (PS6) Mepps: WIP Orphan Slayer Module [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/370225 [19:04:05] (PS1) Ejegg: WIP shared context is its own thing [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/372432 [19:05:07] (CR) jerkins-bot: [V: -1] WIP shared context is its own thing [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/372432 (owner: Ejegg) [19:09:43] (CR) jerkins-bot: [V: -1] WIP Orphan Slayer Module [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/370225 (owner: Mepps) [19:30:48] (PS1) Ejegg: Update evil-dns to 0.1.0 [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/372436 [19:30:56] (CR) Ejegg: [C: 2] Update evil-dns to 0.1.0 [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/372436 (owner: Ejegg) [19:32:03] (Merged) jenkins-bot: Update evil-dns to 0.1.0 [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/372436 (owner: Ejegg) [19:38:23] (PS7) Mepps: WIP Orphan Slayer Module [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/370225 [19:38:40] (PS1) Ejegg: Merge branch 'master' into deployment [wikimedia/fundraising/dash] (deployment) - https://gerrit.wikimedia.org/r/372440 [19:40:55] Fundraising-Backlog, fundraising-tech-ops: Frack replag master thread - https://phabricator.wikimedia.org/T173472#3531968 (Marostegui) >>! In T173472#3530966, @cwdent wrote: > Last night I ran into the first instance where restarting the daemon did NOT cause the lag to drop, on frdb2001. After about an... [19:44:06] (CR) jerkins-bot: [V: -1] WIP Orphan Slayer Module [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/370225 (owner: Mepps) [19:47:59] (PS1) Ejegg: Update modules for 4.8.2 [wikimedia/fundraising/dash/node_modules] - https://gerrit.wikimedia.org/r/372441 [19:50:12] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/prometheus/prometheus.yml] [19:50:47] sorry about that [19:50:49] fixing [19:55:12] RECOVERY - check_puppetrun on pay-lvs1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [19:55:23] Fundraising Sprint Prank Seatbelt, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, FR-Ingenico, FR-WMF-Audit: Ingenico WX audit parser: refunds missing gateway_txn_id - https://phabricator.wikimedia.org/T173457#3532013 (DStrine) [20:03:41] Fundraising-Backlog: Make Silverpop export leaner - https://phabricator.wikimedia.org/T173538#3532019 (CCogdill_WMF) [20:04:09] Fundraising-Backlog: Make Silverpop export leaner - https://phabricator.wikimedia.org/T173538#3532031 (CCogdill_WMF) p:Triage>Normal [21:06:38] (CR) Ejegg: [C: 2] Update modules for 4.8.2 [wikimedia/fundraising/dash/node_modules] - https://gerrit.wikimedia.org/r/372441 (owner: Ejegg) [21:09:19] (CR) Ejegg: [V: 2 C: 2] Update modules for 4.8.2 [wikimedia/fundraising/dash/node_modules] - https://gerrit.wikimedia.org/r/372441 (owner: Ejegg) [21:09:56] hrm? [21:10:08] (Abandoned) Ejegg: Update modules for 4.8.2 [wikimedia/fundraising/dash/node_modules] - https://gerrit.wikimedia.org/r/372441 (owner: Ejegg) [21:14:50] (PS1) Ejegg: Update evil-dns and syslog packages [wikimedia/fundraising/dash/node_modules] - https://gerrit.wikimedia.org/r/372454 [21:15:04] (CR) Ejegg: [C: 2] Update evil-dns and syslog packages [wikimedia/fundraising/dash/node_modules] - https://gerrit.wikimedia.org/r/372454 (owner: Ejegg) [21:15:23] (CR) Ejegg: [V: 2 C: 2] Update evil-dns and syslog packages [wikimedia/fundraising/dash/node_modules] - https://gerrit.wikimedia.org/r/372454 (owner: Ejegg) [21:23:56] (PS1) Ejegg: Update node_modules for evil-dns and syslog [wikimedia/fundraising/dash] (deployment) - https://gerrit.wikimedia.org/r/372457 [21:23:58] (PS1) Ejegg: update dist [wikimedia/fundraising/dash] (deployment) - https://gerrit.wikimedia.org/r/372458 [21:24:16] (CR) Ejegg: [C: 2] Update node_modules for evil-dns and syslog [wikimedia/fundraising/dash] (deployment) - https://gerrit.wikimedia.org/r/372457 (owner: Ejegg) [21:24:26] (CR) Ejegg: [C: 2] update dist [wikimedia/fundraising/dash] (deployment) - https://gerrit.wikimedia.org/r/372458 (owner: Ejegg) [21:27:55] (CR) Ejegg: [C: 2] Merge branch 'master' into deployment [wikimedia/fundraising/dash] (deployment) - https://gerrit.wikimedia.org/r/372440 (owner: Ejegg) [21:31:15] (Merged) jenkins-bot: Merge branch 'master' into deployment [wikimedia/fundraising/dash] (deployment) - https://gerrit.wikimedia.org/r/372440 (owner: Ejegg) [21:31:17] (Merged) jenkins-bot: Update node_modules for evil-dns and syslog [wikimedia/fundraising/dash] (deployment) - https://gerrit.wikimedia.org/r/372457 (owner: Ejegg) [21:31:19] (Merged) jenkins-bot: update dist [wikimedia/fundraising/dash] (deployment) - https://gerrit.wikimedia.org/r/372458 (owner: Ejegg) [21:38:43] !log updated dash from bec0077599737c774cc716851171365b38a8b02a to 696a3fffdd2595331132da9b999c6dc382d0ed69 [21:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:00] phew, got a bit deep into it. Think I might have it working, though! [23:18:48] (PS1) Ejegg: Update all the node_modules [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/372477 [23:18:52] (CR) Ejegg: [C: 2] Update all the node_modules [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/372477 (owner: Ejegg) [23:19:36] (CR) jerkins-bot: [V: -1] Update all the node_modules [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/372477 (owner: Ejegg) [23:19:49] (CR) jerkins-bot: [V: -1] Update all the node_modules [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/372477 (owner: Ejegg) [23:21:02] (PS1) Ejegg: Delete unused modules, update the rest [wikimedia/fundraising/dash/node_modules] - https://gerrit.wikimedia.org/r/372478 [23:32:55] (CR) Ejegg: [V: 2 C: 2] Delete unused modules, update the rest [wikimedia/fundraising/dash/node_modules] - https://gerrit.wikimedia.org/r/372478 (owner: Ejegg) [23:33:38] (CR) Ejegg: [C: 2] Update all the node_modules [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/372477 (owner: Ejegg) [23:34:30] (CR) jerkins-bot: [V: -1] Update all the node_modules [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/372477 (owner: Ejegg) [23:37:16] (CR) Ejegg: [V: 2 C: 2] Update all the node_modules [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/372477 (owner: Ejegg) [23:38:08] (CR) jerkins-bot: [V: -1] Update all the node_modules [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/372477 (owner: Ejegg) [23:43:02] (PS1) Ejegg: Merge branch 'master' into deployment [wikimedia/fundraising/dash] (deployment) - https://gerrit.wikimedia.org/r/372480 [23:43:05] (PS1) Ejegg: Update all node modules [wikimedia/fundraising/dash] (deployment) - https://gerrit.wikimedia.org/r/372481 [23:43:06] (CR) Ejegg: [C: 2] Update all node modules [wikimedia/fundraising/dash] (deployment) - https://gerrit.wikimedia.org/r/372481 (owner: Ejegg) [23:43:55] (CR) jerkins-bot: [V: -1] Merge branch 'master' into deployment [wikimedia/fundraising/dash] (deployment) - https://gerrit.wikimedia.org/r/372480 (owner: Ejegg)