[01:07:40] (PS5) Ejegg: CRM-21521: read multipart-related inside report [wikimedia/fundraising/crm/civicrm] - https://gerrit.wikimedia.org/r/395566 (https://phabricator.wikimedia.org/T181934) [01:09:18] ejegg[m]: I'm happy to merge that Email one now - if you are good with that [01:09:28] (ie. into our repo I mean) [01:09:36] the fact the test is passing upstream makes me happy [01:12:36] thanks! [01:13:12] oh shoot, my matrix clone is back... locked myself out of my irc box for a bit yesterday and tried Riot.im. Now [m] will be around for weeks :S [01:16:46] ejegg[m]: to answer your own question on github - yes, squashing commits is good [01:18:06] (CR) Eileen: [C: 2] "I'm happy with this on the basis that there are unit tests upstream that are passing & the bug was replicated in those, and fixed." [wikimedia/fundraising/crm/civicrm] - https://gerrit.wikimedia.org/r/395566 (https://phabricator.wikimedia.org/T181934) (owner: Ejegg) [01:18:44] (CR) Eileen: [C: 2] "the pain is over" [wikimedia/fundraising/crm/civicrm] - https://gerrit.wikimedia.org/r/396278 (https://phabricator.wikimedia.org/T181934) (owner: Ejegg) [01:22:46] (Merged) jenkins-bot: Undo mailing debug logging [wikimedia/fundraising/crm/civicrm] - https://gerrit.wikimedia.org/r/396278 (https://phabricator.wikimedia.org/T181934) (owner: Ejegg) [01:25:21] (Merged) jenkins-bot: CRM-21521: read multipart-related inside report [wikimedia/fundraising/crm/civicrm] - https://gerrit.wikimedia.org/r/395566 (https://phabricator.wikimedia.org/T181934) (owner: Ejegg) [03:35:09] PROBLEM - check_disk on civi1001 is CRITICAL: DISK CRITICAL - free space: / 5842 MB (10% inode=94%): /dev 10 MB (100% inode=99%): /run 2874 MB (89% inode=99%): /dev/shm 7986 MB (100% inode=99%): /run/lock 5 MB (100% inode=99%): /sys/fs/cgroup 7986 MB (100% inode=99%): /boot 216 MB (86% inode=99%): /srv 724627 MB (87% inode=99%): /srv/archive/banner_logs 1353311 MB (29% inode=99%) [03:40:09] PROBLEM - check_disk on civi1001 is CRITICAL: DISK CRITICAL - free space: / 5826 MB (10% inode=94%): /dev 10 MB (100% inode=99%): /run 2890 MB (90% inode=99%): /dev/shm 7986 MB (100% inode=99%): /run/lock 5 MB (100% inode=99%): /sys/fs/cgroup 7986 MB (100% inode=99%): /boot 216 MB (86% inode=99%): /srv 724627 MB (87% inode=99%): /srv/archive/banner_logs 1353285 MB (29% inode=99%) [03:45:09] PROBLEM - check_disk on civi1001 is CRITICAL: DISK CRITICAL - free space: / 5808 MB (10% inode=94%): /dev 10 MB (100% inode=99%): /run 2882 MB (90% inode=99%): /dev/shm 7986 MB (100% inode=99%): /run/lock 5 MB (100% inode=99%): /sys/fs/cgroup 7986 MB (100% inode=99%): /boot 216 MB (86% inode=99%): /srv 724627 MB (87% inode=99%): /srv/archive/banner_logs 1353259 MB (29% inode=99%) [04:04:33] checking [04:09:20] cwd: are you looking at civi1001? [04:09:40] Jeff_Green: yeah...not seeing any full partitions [04:10:19] the root partition was over 80% which I think is the critical threshold for that box, I removed some older logs as a quick fix, but we really need to clear out /tmp [04:11:43] mmm yes [04:12:35] that's a lot of raw recipient data [04:12:44] yup [04:14:53] must be an automated process doing this [04:17:07] yeah [15:05:19] PROBLEM - check_procs on frdb1001 is CRITICAL: PROCS CRITICAL: 1464 processes [15:05:54] -_- [15:08:54] ton of sleeping connections [15:09:13] mepps: didn't you say dash was running out of mysql handles? [15:10:19] RECOVERY - check_procs on frdb1001 is OK: PROCS OK: 188 processes [15:13:32] cwd yeah it keeps saying too many connections [15:14:19] bet it's related [15:21:04] i wonder if it could be related to this cwd: https://phabricator.wikimedia.org/T181590 ? [15:25:32] mm interesting [15:25:38] all the connections i see are sleeping [15:25:46] so it seems like an additional problem [15:31:20] mepps: https://grafana.wikimedia.org/dashboard/db/fundraising-host-overview?refresh=5m&panelId=13&fullscreen&orgId=1&var-server=frdev1001.frack.eqiad.wmnet&var-datasource=frack.codfw%20prometheus&from=now%2Fw&to=now [15:31:50] got to be a garbage collection failure [15:33:37] mepps: do you know if dash connects to both the read and write dbs? [15:34:51] hmm it looks like it's just set up to work with fredge [15:34:57] but it does read and write [15:35:16] oh wait [15:35:22] it does use both the civi db and fredge [15:37:18] Fundraising-Backlog: dash(?) is leaving mysql connections dangling - https://phabricator.wikimedia.org/T182440#3823237 (cwdent) [15:38:47] Fundraising-Backlog, Wikimedia-Fundraising-Banners, Regression: Banner "B1617_112919_en6C_dsk_p1_lg_bdr_prp" makes login, signup and search inaccesible - https://phabricator.wikimedia.org/T151962#3823258 (Pcoombe) [15:43:36] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: Look at creating job to clear tmp dir in Civi - https://phabricator.wikimedia.org/T176691#3633945 (cwdent) *bump* We are starting to see disk space alerts on civi1001, appears to be a ton of "raw recipient data" in /tmp which sounds like a good thing t... [15:45:08] ooh, funky! [15:45:21] mepps yeah, and the drupal db [15:45:28] morning ejegg [15:45:34] morning all [15:45:42] jgleeson: o/ [15:45:47] did you get logged in to mysql yet? [15:45:59] not yet, will try now [15:46:12] hi cwd [15:47:40] hi jgleeson [15:47:44] fundraising-tech-ops: Excessive sysloggin' on civi1001 - https://phabricator.wikimedia.org/T182442#3823271 (cwdent) [15:48:24] cwd ooh, are those dash queries? [15:48:48] yes cwd I can access it :) [15:48:54] super [15:49:10] ejegg: i am guessing the ones on frdev1001 are [15:49:16] but they aren't even queries, just sleeping connections [15:49:24] so maybe the adapter doesn't get cleaned up? [15:49:33] esp. cause node runs as a daemon [15:49:47] easy to have leaks [15:50:01] huh, lessee what lib we're using for mysql [15:50:30] it could default to e.g. send a byte every few seconds to keep the connection alive [15:50:45] pure speculation of course [15:54:04] hmm, there's a mysql-promise library [15:54:05] i'm not sure we close the connection: https://github.com/martinj/node-mysql-promise/issues/66 [15:54:15] ejegg cwd ^^ [15:54:24] ooh, good catch [15:54:52] ah ha [15:56:29] hmm, mepps so we only use the promise stuff in persistence.js [15:56:46] I guess we could make a queryAndClose helper method [15:56:57] that takes the connection and the query text [15:58:01] and chains a .then clause to close after the quey comes back [15:58:45] does that sound right? [15:59:35] sounds promising ejegg (pun only half intentional) [16:00:16] haha [16:04:17] ok, so, for 'Database lock encountered' errors, do we want to just make those retry-able [16:04:33] and not send the failmail? [16:05:15] or do we think we can get rid of the problematic locks? [16:05:34] 'cause we should do one or the other if we want to make failmail rare [16:09:57] do we know why we're encountering them ejegg? [16:14:51] ejegg cwd dstrine, should i prioritize this close connection issue today or wait for it to go in sprint? [16:15:35] up to y'all, we get alerts about it so it's not great but it's also not actually crashing anything [16:15:58] ejegg: database lock encountered == deadlock? [16:16:26] jgleeson: ejegg AndyRussG mepps I just realized that we have the "coffee time" meeting today. do we want to meet for standup at like 8:45? [16:16:59] mepps: I'm reading backscroll now... [16:19:12] mepps: it's kinda up to you all. If it's annoying to you all and will reduce failmail or other alerts than it's probably worth while [16:20:33] in case anyone is not familiar with deadlocks, they are different from locks blocking other queries, which is a common mixup [16:20:43] deadlock is... [16:21:18] txn a locks resource 1, txn b locks resource 2, txn a requests resource 2, txn b requests resource 1 [16:21:54] this situation will never resolve no matter how long you wait, so the server kills one thread at random and lets the other complete [16:22:40] you can avoid this by using exclusive locks which would absolutely murder performance [16:23:00] or heavy auditing of queries, which in the context of civi is probably nigh impossible [16:23:30] Fundraising-Backlog, MediaWiki-extensions-CentralNotice, WMDE-Fundraising-CN: Mobile impression drop on German Wikipedia - https://phabricator.wikimedia.org/T182446#3823400 (DStrine) [16:23:41] (PS1) Ejegg: Better requeue on db locks [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/396422 [16:24:03] cwd sorry, no, there's actually a diff message for the deadlocks [16:24:16] I think 'Database lock encountered' is 'Lock wait timeout' [16:24:26] aah ok gotcha [16:24:40] don't mean to mansplain dbs :P [16:24:43] just to make sure we're all on the same page [16:25:09] dstrine, I can do stand-up early [16:25:40] Fundraising Sprint Winter Wanderland, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, WMDE-Fundraising-CN: Mobile impression drop on German Wikipedia - https://phabricator.wikimedia.org/T182446#3823413 (DStrine) p:Triage>High [16:28:29] AndyRussG: you around? [16:31:30] sure, early 👨☝ works for me [16:32:48] thanks ejegg I moved the meeting [16:32:57] phooey, my cheap headphones broke. Lessee if the backup pair is any good [16:34:40] mepps what would you be putting aside to work on the 'close connection'? [16:35:06] not much ejegg [16:35:16] it's just not in sprint which is why i asked [16:35:59] cool, i'd say go for it! [16:37:16] so I forgot, we already WERE trying to make the lock errors retry-able [16:37:37] I guess I just had it looking in the wrong place for the message [16:40:18] yeah wasn't that a recent change? [16:41:09] i thought i remembered a recent patch dealing with the deadlock errors [16:47:31] mepps: jgleeson AndyRussG would you be free for a slightly earlier chat? [16:54:11] dstrine: yes I'm here [16:54:20] sorry, didn't se th eping! [16:54:28] th'e-ping [16:54:33] AndyRussG: you got time for standup? [16:54:42] we moved it a little early so people can get to coffee [16:54:51] yes of course [17:02:35] Fundraising Sprint Winter Wanderland, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, WMDE-Fundraising-CN: Mobile impression drop on German Wikipedia - https://phabricator.wikimedia.org/T182446#3823479 (DStrine) @kai.nissen and @tmletzko are you having trouble with specific campaigns or... [17:13:01] Fundraising Sprint Winter Wanderland, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, WMDE-Fundraising-CN: Mobile impression drop on German Wikipedia - https://phabricator.wikimedia.org/T182446#3823516 (tmletzko) hi @DStrine we do not have access to your data analysis tools. We received... [17:59:36] Fundraising Sprint Vaporwerewolf, Fundraising Sprint Winter Wanderland, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Patch-For-Review: New Civi Import - https://phabricator.wikimedia.org/T172423#3823613 (jkim_wikimedia) @Ejegg I don't think so - it doesn't seem like we've saved that in... [18:11:38] Fundraising Sprint Kickstopper, Wikimedia-Fundraising-CiviCRM, FR-Paypal, FR-WMF-Audit: Follow up with Paypal on audit regeneration, enable parser - https://phabricator.wikimedia.org/T167828#3346054 (Unicorntkd) hi i am new here can u help me [19:24:53] AndyRussG, I have just pushed the latest version of the stats collector library that I've been using to record stats across the queue2civicrm process if you wanted to take a peak. When your python stats library is ready to share, we can put our heads together and discuss how we can combine the two libraries as proposed in yesterdays standup to create our super stats failmail tracking power ranger :) [19:25:16] ah, the link you say.. well that's here: https://github.com/jackgleeson/stats-collector [19:28:07] fr-tech I am signing off for tonight. Have a great weekend all and I will see you when I am back next Thursday. Au revoir :) [19:32:58] Fundraising Sprint Winter Wanderland, Fundraising-Backlog, fundraising-tech-ops: Civi credential for new Engage user - https://phabricator.wikimedia.org/T181537#3823723 (cwdent) I sent the certificate to the address listed and the mail bounced. Is there a reason we don't get all the contractors WMF... [20:00:53] fundraising-tech-ops: Forward katherine@wikipedia.org and jimmy@wikipedia.org emails to katherine@wikimedia.org and jimmy@wikimedia.org, respectively - https://phabricator.wikimedia.org/T182456#3823808 (bcampbell) [20:18:47] Fundraising Sprint Winter Wanderland, Fundraising-Backlog, fundraising-tech-ops: Civi credential for new Engage user - https://phabricator.wikimedia.org/T181537#3823904 (LeanneS) Thanks @cwdent. They just provided a different email address to try. Could you send to the new email on the contact page?... [20:41:16] Fundraising Sprint Kickstopper, Wikimedia-Fundraising-CiviCRM, FR-Paypal, FR-WMF-Audit: Follow up with Paypal on audit regeneration, enable parser - https://phabricator.wikimedia.org/T167828#3824017 (Reedy) declined>Open a:Ejegg [20:41:24] Fundraising Sprint Kickstopper, Wikimedia-Fundraising-CiviCRM, FR-Paypal, FR-WMF-Audit: Follow up with Paypal on audit regeneration, enable parser - https://phabricator.wikimedia.org/T167828#3346054 (Reedy) [22:44:17] Fundraising Sprint Winter Wanderland, Fundraising-Backlog, fundraising-tech-ops: Civi credential for new Engage user - https://phabricator.wikimedia.org/T181537#3824364 (cwdent) Sent! [22:46:16] https://i.imgur.com/FVNCPzl.jpg