[06:42:03] eileen: I can type here, if there's anything I can add to the meeting? [06:42:47] awight_mob: ah just your mobile? [06:42:55] Exactly [06:42:59] We were wondering if same time tomorrow would work? [06:43:12] Yes, sounds good! [06:43:36] great - the one thing we covered was sample data [06:43:47] I should have the guest wifi password by then ;-) [06:43:53] :-) [06:44:06] so what is the deal - you are living where at the moment? [06:44:28] hi @awight_mob @eileen [06:44:39] hey saurabh [06:44:47] Okay perfect, my only input about sample data is that things are easier if we have a "balanced" set, meaning 50% fraud and 50% not. [06:44:59] Hi saurabh o/ [06:45:26] Yup, but that's probably not going to be the case in a real world scenario [06:45:44] I might be getting ahead of things though, if you're just discussing a few snippets to analyze the format [06:46:12] What we can do is only pick non-fraudulent data points equivalent to the number of fraudulent data-points in the training dataset [06:46:37] saurabh: can you post your dataset link? [06:46:44] So I found this dataset about credit card fraud - https://www.kaggle.com/mlg-ulb/creditcardfraud [06:47:14] The problem being "we have 492 frauds out of 284,807 transactions" [06:48:03] Right, that's the ticket, to synthesize the balanced set... I'm not certain whether it makes a difference for all ML algorithms, but believe it does for some, so for exploring algos this will make it simpler. Also, the model health statistics will be easier to interpret. [06:48:25] 500/500 might be fine to start with! [06:49:35] awight_mob: so you are suggesting it's not real world but it's easier to work with during coding in your experience? [06:50:16] Yah, exactly. [06:50:24] Yup, makes sense to train on that [06:51:06] So which model do you think I should start experimenting with? [06:51:28] Basically, the additional 284k nonfraud samples will just make our ai better at recognizing nonfraud, which isn't what we need it to specialize in. [06:51:51] Yup [06:52:00] But another issue being [06:52:15] If we use all 500 of the fraudulent ones, how can we test our model? [06:52:30] A 450/50 training/testing split? [06:52:36] saurabh: I should be clear that I'm a beginner in ML myself :-) but from what I've seen, we should try a handful of models, it's really hard to guess what will work. [06:53:19] So how about I start with the usual classifying algorithms like Logistic Regression and SVMs etc? [06:54:43] Scikit-learn includes some cross validation stuff, tldr, it makes several folds E.g. all ten 90-10% splits, trains and tests each of those models, and if they all behave roughly the same, we can feel justified in training using the full set. [06:55:06] Sure, SVM, gradient boosting, random forest... [06:55:40] I found a nice one page guide to shooting from the hip when choosing algorithms, if you ping me later I can forward [06:55:53] Yeah I'll drop you a mail [06:56:03] And get started with some light experiementation [06:56:11] :100%: [06:56:22] Exciting stuff! [06:56:43] The kaggle link also has posts by people who've experimented with the data with various models [06:57:21] I'll go through them once too, start off with something that looks promising [06:58:52] This was the page I was thinking of, https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/ [07:00:19] Cool, with a 1k set it shouldn't take much CPU time. Feature engineering and dimensionality reduction will probably be the fun part. [07:00:45] Yup [07:01:06] Although the dataset columns aren't labeled I think which might cause some problem [07:01:08] *problems [07:01:20] Thanks for putting up with my ELIZA appearance today, hopefully I'm full on Max Headroom by tomorrow [07:03:31] saurabh - you'll send through a new invite? [07:04:32] Yup sure [07:06:47] Same time tomorrow works for everybody? [07:12:58] for me yes [15:18:20] Fundraising Sprint Junebugs prefer July, Fundraising-Backlog: Cancel & refund the remaining unintended recurring donations from Big EN - https://phabricator.wikimedia.org/T192958#4220909 (mepps) @XenoRyet can you update your query for globalcollect? And can you see how many donors in paypal would be affe... [15:19:24] Fundraising Sprint Junebugs prefer July, Fundraising-Backlog: Cancel & refund the remaining unintended recurring donations from Big EN - https://phabricator.wikimedia.org/T192958#4220912 (mepps) @MBeat33 When you look at those donors still affected by this banner, were any of their charges refunded? [15:24:11] Fundraising Sprint Junebugs prefer July, Fundraising-Backlog: Cancel & refund the remaining unintended recurring donations from Big EN - https://phabricator.wikimedia.org/T192958#4220920 (MBeat33) @mepps I'm not sure how to query all donations from that banner to spot-check. All I know is it was in the l... [15:27:24] Fundraising Sprint Junebugs prefer July, Fundraising-Backlog: Cancel & refund the remaining unintended recurring donations from Big EN - https://phabricator.wikimedia.org/T192958#4220931 (mepps) @MBeat33 yeah @XenoRyet can do the query, I was more curious what you were seeing with the donors reporting. [15:30:03] Fundraising Sprint Junebugs prefer July, Fundraising-Backlog: Cancel & refund the remaining unintended recurring donations from Big EN - https://phabricator.wikimedia.org/T192958#4220942 (MBeat33) Ah, yes, the donors who are reaching out are ones where the donations need to be refunded as well as cancele... [15:36:47] Fundraising Sprint Junebugs prefer July, Fundraising-Backlog: Cancel & refund the remaining unintended recurring donations from Big EN - https://phabricator.wikimedia.org/T192958#4220957 (mepps) @MBeat33 Got it, I just wanted to make sure one or the other process didn't fail. It sounds like they just wer... [19:32:08] AndyRussG: meeting? [19:32:30] :) [19:36:49] Fundraising-Backlog: Prospect Tab- Reviewed Field - https://phabricator.wikimedia.org/T194784#4221230 (DStrine) [19:37:42] Fundraising-Backlog: Contact Report Filters not displaying tag set tags - https://phabricator.wikimedia.org/T194783#4208403 (DStrine) [19:39:45] Fundraising-Backlog: Testing infrastructure for EventLogging ingress of banner impression and landing page data - https://phabricator.wikimedia.org/T195259#4221243 (AndyRussG) [20:02:33] Fundraising-Backlog, MediaWiki-extensions-CentralNotice: Improve remind me later JS for advancement banners - https://phabricator.wikimedia.org/T195260#4221269 (DStrine) [20:05:16] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: Assess and implement GDPR - https://phabricator.wikimedia.org/T195261#4221282 (DStrine) [21:45:14] Fundraising-Backlog: Creating a new record without email - https://phabricator.wikimedia.org/T195266#4221475 (NNichols) [21:56:44] cwd: hey... I'm here anytime you want to look at this... [22:05:35] AndyRussG: i'm around [22:06:04] was just working w/ otto some more, still can't quite get this to work [22:06:33] but entirely separate issue [22:11:02] cwd: okok yeah no rush :) mebbe can u post here the exact kafkacat command you'd use currently to monitor the topic? I can start re-checking the JS [22:14:01] AndyRussG: kafkacat -C -b kafka-jumbo1002.eqiad.wmnet:9092 -t eventlogging_CentralNoticeImpression [22:17:11] cwd: okok.... From which box? Also, why kafka-jumbo? This example uses kafka1012.eqiad.wmnet from stat1002: https://wikitech.wikimedia.org/wiki/Kafka#Consume [22:17:16] Maybe there's some doc I'm missing [22:17:18] thx!!! [22:17:36] AndyRussG: i believe they replaced all the kafka servers [22:17:41] now it is jumbo1-6 [22:17:53] do it from alnitak [22:17:57] it is a codfw server [22:18:20] alnitak.codfw.wmnet? [22:18:30] alnitak.frack.codfw.wmnet [22:19:30] Ah okok [22:26:51] cwd: to get there, I should be going through frbast.wikimedia.org, and using my frack credentials, no? [22:27:51] Hmm looks like I still have some updating to do in my ssh config as per https://wikitech.wikimedia.org/wiki/Fundraising/tech/ssh_config [22:30:29] AndyRussG: that works [22:30:36] you can also use rigel which is the codfw bastion [22:30:38] doesn't really matter [22:33:37] hmmm [22:34:18] AndyRussG: having trouble? [22:34:28] yeah for some reason it was trying to go through bast2001 and trying to use the ssh key for normal prod [22:34:35] no just gonna try that config, one sec [22:34:49] hopefully it'll work :) [22:35:05] fail2ban is pretty sensitive but i can unblock you [22:35:10] if it hangs [22:43:03] cwd: all good, got in with that ssh config :) thx!!! [22:45:34] Can I check the fingerprints newhere? [22:47:50] https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints [22:58:31] AndyRussG: ah, good call, i will add them to a wiki [22:58:37] i don't think there is one atm [22:58:47] ah ok thx much :) [23:01:14] btw again I didn't get the event from the campaign on ruwiki, even though I did see it sent from the browser [23:02:18] however I do see it when sent from the campaign on wikibooks [23:02:29] At least that, I doubt, is a JS issue [23:06:42] hrrmmm also not getting the event on enwiki [23:06:48] AndyRussG: that was my experience also [23:08:02] but i also can't get the wikibooks one to work in ff [23:08:06] and i do not see the request [23:10:32] i have a few privacy extensions but they are all green lights [23:19:38] cwd:when you don't see the wikibooks one in ff, do you see it in kafkacat? [23:20:05] can you maybe run a different firefox instance without the plugins (like maybe as a different user on your machine)? [23:21:18] (I often run X programs using more than one local user, just give them permission first: sudo xhost +SI:localuser:some_other_user [23:21:20] ) [23:21:51] (after that you can log in as some_other_user and run ff or anything else under a fresh profile) [23:22:34] I wonder what happens to events that are somehow not validated by the EL schema... Do they not come down the Kafka pipe, I guess? [23:23:08] good question [23:23:42] Hmmm they go to their own topic: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Verify_received_events [23:24:52] cwd: yep that's it [23:25:30] ok gotta relocate in a sec, but that must indeed be a JS issue [23:25:39] so I'll dig at that... [23:26:00] or well, I should say, likely is a JS issue [23:28:10] K back in a bit! [23:30:35] ok, i'll be around