[07:10:17] PROBLEM - check_procs on frdev1001 is CRITICAL: PROCS CRITICAL: 1566 processes [07:15:08] RECOVERY - check_procs on frdev1001 is OK: PROCS OK: 262 processes [21:10:20] (CR) Eileen: [C: 2] "Makes sense to me. It would be good to identify how to make core pass the info up the exception chain in a better way. Perhaps core should" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/396422 (owner: Ejegg) [21:12:52] (CR) Eileen: [C: 1] "I don't know so much about dash but feel pretty sure this is just adding extra gateways to an array. I would plus 2 it if I was sure there" [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/394636 (owner: Ejegg) [21:13:26] (Merged) jenkins-bot: Better requeue on db locks [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/396422 (owner: Ejegg) [22:55:18] PROBLEM - check_redis on frqueue1001 is CRITICAL: CRITICAL: payments-antifraud is 4438 2000 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 7 keys, up 82 days 6 hours - memory use is 4.97M (peak 23.12M, 0.09% of max, fragmentation 1.56%), connected_slaves is 2, donations is 18, jobs is 0, jobs-adyen is 0, jobs-paypal is 38, payments-init is 32, pending is 0, recurring is 9, refund is 0, unsubscribe is 19 [23:00:08] PROBLEM - check_redis on frqueue1001 is CRITICAL: CRITICAL: payments-antifraud is 6488 2000 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8 keys, up 82 days 6 hours - memory use is 6.20M (peak 23.12M, 0.11% of max, fragmentation 1.47%), connected_slaves is 2, donations is 9, jobs is 0, jobs-adyen is 0, jobs-paypal is 27, payments-init is 32, pending is 18, recurring is 13, refund is 0, unsubscribe is 19 [23:05:08] PROBLEM - check_redis on frqueue1001 is CRITICAL: CRITICAL: payments-antifraud is 14254 2000 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8 keys, up 82 days 6 hours - memory use is 11.29M (peak 23.12M, 0.17% of max, fragmentation 1.26%), connected_slaves is 2, donations is 12, jobs is 0, jobs-adyen is 0, jobs-paypal is 36, payments-init is 18, pending is 15, recurring is 4, refund is 0, unsubscribe is 2 [23:05:50] Yikes [23:06:28] be at the computer in 10 mins [23:10:08] PROBLEM - check_redis on frqueue1001 is CRITICAL: CRITICAL: payments-antifraud is 22892 2000 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 7 keys, up 82 days 6 hours - memory use is 16.96M (peak 23.12M, 0.24% of max, fragmentation 1.18%), connected_slaves is 2, donations is 12, jobs is 0, jobs-adyen is 0, jobs-paypal is 26, payments-init is 56, pending is 0, recurring is 4, refund is 0, unsubscribe is 4 [23:15:17] PROBLEM - check_redis on frqueue1001 is CRITICAL: CRITICAL: payments-antifraud is 26924 2000 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8 keys, up 82 days 6 hours - memory use is 19.61M (peak 23.12M, 0.28% of max, fragmentation 1.15%), connected_slaves is 2, donations is 17, jobs is 0, jobs-adyen is 0, jobs-paypal is 20, payments-init is 33, pending is 17, recurring is 8, refund is 0, unsubscribe is 6 [23:20:08] PROBLEM - check_redis on frqueue1001 is CRITICAL: CRITICAL: payments-antifraud is 26958 2000 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8 keys, up 82 days 6 hours - memory use is 19.62M (peak 23.12M, 0.28% of max, fragmentation 1.16%), connected_slaves is 2, donations is 5, jobs is 0, jobs-adyen is 0, jobs-paypal is 30, payments-init is 15, pending is 15, recurring is 8, refund is 0, unsubscribe is 7 [23:24:01] well to state the obvious something is putting a lot of messages in the antifraud queue [23:24:16] Numeric value out of range: 1264 Out of range value for column 'risk_score' [23:25:08] PROBLEM - check_redis on frqueue1001 is CRITICAL: CRITICAL: payments-antifraud is 26979 2000 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8 keys, up 82 days 6 hours - memory use is 19.65M (peak 23.12M, 0.28% of max, fragmentation 1.16%), connected_slaves is 2, donations is 13, jobs is 0, jobs-adyen is 0, jobs-paypal is 27, payments-init is 47, pending is 19, recurring is 3, refund is 0, unsubscribe is 11 [23:30:08] PROBLEM - check_redis on frqueue1001 is CRITICAL: CRITICAL: payments-antifraud is 27003 2000 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8 keys, up 82 days 6 hours - memory use is 19.68M (peak 23.12M, 0.28% of max, fragmentation 1.15%), connected_slaves is 2, donations is 10, jobs is 0, jobs-adyen is 0, jobs-paypal is 34, payments-init is 48, pending is 1, recurring is 5, refund is 0, unsubscribe is 12 [23:35:17] PROBLEM - check_redis on frqueue1001 is CRITICAL: CRITICAL: payments-antifraud is 27026 2000 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8 keys, up 82 days 6 hours - memory use is 19.70M (peak 23.12M, 0.28% of max, fragmentation 1.15%), connected_slaves is 2, donations is 15, jobs is 0, jobs-adyen is 0, jobs-paypal is 27, payments-init is 26, pending is 27, recurring is 7, refund is 0, unsubscribe is 14 [23:40:08] PROBLEM - check_redis on frqueue1001 is CRITICAL: CRITICAL: payments-antifraud is 27062 2000 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8 keys, up 82 days 6 hours - memory use is 19.73M (peak 23.12M, 0.28% of max, fragmentation 1.15%), connected_slaves is 2, donations is 7, jobs is 0, jobs-adyen is 0, jobs-paypal is 25, payments-init is 56, pending is 15, recurring is 11, refund is 0, unsubscribe is 15 [23:45:10] Fundraising-Backlog: Failing job AntifraudQueueConsumer.php on risk_score format - https://phabricator.wikimedia.org/T183102#3843472 (Eileenmcnaughton) [23:45:17] PROBLEM - check_redis on frqueue1001 is CRITICAL: CRITICAL: payments-antifraud is 27090 2000 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8 keys, up 82 days 6 hours - memory use is 19.77M (peak 23.12M, 0.28% of max, fragmentation 1.15%), connected_slaves is 2, donations is 12, jobs is 0, jobs-adyen is 0, jobs-paypal is 16, payments-init is 44, pending is 21, recurring is 3, refund is 0, unsubscribe is 17 [23:47:01] eileen: hey [23:47:23] i don't see any recent deploys or anything [23:47:31] wondering if this is just front end spam [23:48:49] cwd is this the risk score one? [23:49:02] https://phabricator.wikimedia.org/T183102 [23:49:05] gotta be related right? [23:49:11] anitfraud queue bloat [23:49:12] I was just wondering if we can remove that item from the q [23:49:35] there is an order that is rejected and is giving an unhandlable format [23:50:09] PROBLEM - check_redis on frqueue1001 is CRITICAL: CRITICAL: payments-antifraud is 27122 2000 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8 keys, up 82 days 7 hours - memory use is 19.74M (peak 23.12M, 0.28% of max, fragmentation 1.16%), connected_slaves is 2, donations is 9, jobs is 0, jobs-adyen is 0, jobs-paypal is 16, payments-init is 19, pending is 22, recurring is 3, refund is 0, unsubscribe is 18 [23:50:43] Am I interpreting this correctly - the risk score for that transaction is sooooo high we can't cope? [23:50:48] eileen: it's failing to pop from the queue? [23:50:55] yeah that is what it looks like... [23:51:24] eileen: is this consumer not a p-c job? [23:51:54] well there are 2 errors [23:52:02] 1 is on the multiqueueconsumer [23:52:22] I think the other is too [23:52:40] but if we turn that job off will donations & thank yous continue? [23:52:46] (I think so) [23:53:25] eileen: the fail mails i see are for the antifraud consumer [23:53:34] where do you see multiqueue? [23:53:47] in the failmail - says /fredge_multiqueue_consumer/fredge_multiqueue_consumer-20171217-234901.log [23:54:03] "Fredge Multiqueue Consumer failed with code 1" in subject [23:54:14] but looks to be the same falure cause [23:54:29] oh, gotcha [23:54:37] i am looking at the ones that say UNKNOWN ERROR [23:55:08] PROBLEM - check_redis on frqueue1001 is CRITICAL: CRITICAL: payments-antifraud is 27147 2000 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8 keys, up 82 days 7 hours - memory use is 19.78M (peak 23.12M, 0.28% of max, fragmentation 1.16%), connected_slaves is 2, donations is 11, jobs is 0, jobs-adyen is 0, jobs-paypal is 15, payments-init is 53, pending is 23, recurring is 3, refund is 0, unsubscribe is 20 [23:55:48] oh so maybe fredge multiqueue does antifraud as part of the multi? [23:57:52] guessing so [23:58:30] I can replicate bug in a unit test [23:58:51] by having an out of bounds risk score?