[00:56:43] leila, it's not an intuitive system, don't feel bad! Notes: [00:56:43] (a) We should not Manually add the markers - the software adds those (after a translation-admin clicks "Mark this page for translation"), and it can easily get confused if humans do it! [00:56:46] (b) by default it will mark separate paragraphs for translation. I.e. if anything has a blank line above & below, then it will get a separate translation-marker. [00:56:49] (c) it won't show the bar until a translation-admin has marked the page as ready. [00:56:50] So, if I look at your edits, I think you were manually adding those 4 markers, with the intent of only having the lines directly underneath be translated. (and not the bits in-between). What you'd need to do, is individually wrap the bits you want translated with multiple ... tags. [00:56:53] * leila reads [00:56:54] Something like this https://meta.wikimedia.org/w/index.php?title=Research:Characterizing_Wikipedia_Reader_Behaviour/Taxonomy_of_Wikipedia_use_cases&diff=18354793&oldid=18354766&diffmode=source [00:58:09] quiddity: oh! So I had got it almost completely wrong. :D [00:58:14] quiddity: THANK you. Super helpful. [00:58:16] but a good guess! [00:58:24] * leila goes to fix the mess she has created. [01:01:24] quiddity: sooo, does doing this mean that volunteers will be triggered to do translations? cuz if yes, that's not the intention, yet. [01:02:09] quiddity: I'm trying to represent a text in 14 languages in a human readable format, and showing it via the translation tabs seemed to be a good way to do it (and I definitely don't mind if folks start translating it to other languages). [01:02:16] leila, nope, not until a translation-admin (e.g. me) clicks the "mark as ready for translation" link (which you can't see) [01:02:49] quiddity: got it. And is the norm for you to automatically approve or people will check with us? [01:02:56] people -> translation admin(s) [01:03:29] as long as the {draft} template is there, noone will touch it. [01:15:05] * leila goes to a couple of hours of hibernation. [15:30:02] Amir1: yt? [15:30:27] nuria: sup? [15:30:55] Amir1: would you happen to know if there are any encodings used on farsi commonly (non utf-8) ones [15:51:12] nuria: there's nothing beside utf-8 or it's almost minimum [15:51:50] nuria: but URLs are html encoded like %D8%A2%D9%84%D9%86%20%D8%AA%D9%88%D8%B1%DB%8C%D9%86%DA%AF [15:51:56] That's "آلن تورینگ" [15:52:05] https://meyerweb.com/eric/tools/dencoder/ [16:22:30] dsaez: which FB paper did you refer to in the meeting today? [16:52:14] hey bmansurov, leila [16:52:22] ottomata: o/ [16:52:23] https://phabricator.wikimedia.org/T191086#4560145 [16:52:30] i'm blacklisting this schema from mysql import! [16:52:52] ottomata: ok, thanks [16:53:10] was it expected to be this many events? [16:53:17] ottomata: yes [16:53:27] I just replied to the comment. [16:53:47] ah, nuria knew... [16:53:48] haha [16:53:48] ok [16:54:03] ok in the future give us an estimate on events / second if you can [16:54:09] 100% doesn't mean anytyhing to me or luca on just a glance [16:54:16] but this is more events than any other schema is sending [16:54:19] system should handle it [16:54:21] but mysql won't [16:54:44] ottomata: ok, I'll do a better job of estimating and informing you guys. sorry about the mess up this time. [16:54:52] no prob, i think no harm done [16:54:57] good to know [16:55:04] good thing it happened while we were watching! :) [16:55:12] phew [16:55:36] hello :) [16:55:42] elukey: o/ [16:57:11] * leila reads [16:57:17] so what's the plan? :) [16:57:29] i think things will be ok, i blacklisted the schema from mysql import [16:57:32] so far system is fine [16:58:22] so from https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=eventlog1002&var-datasource=eqiad%20prometheus%2Fops&from=now-3h&to=now-1m it seems fine indeed but the increase in resource usage was high [16:58:24] bmansurov: when are we dialing schema down again? [16:58:29] * leila sees the problem in good hands. [16:59:13] if it is not temporary I'd suggest, if possible, to dial it down a bit just to let it go for a couple of days and see resource usage [16:59:44] nuria: not sure, leila? [16:59:46] bmansurov: rule of thumb (i'm just making up), if a schema usage is going to do more than 100-200 events/sec, we should probably blacklist it from mysql [17:00:01] ottomata: ok, got it [17:00:07] miriam: ^ [17:00:32] leila: thanks, I am reading! [17:00:47] miriam: also https://phabricator.wikimedia.org/T191086#4560166 [17:00:59] miriam: I wonder if it's worth stopping data collection for that reason ^? [17:01:02] miriam: we discussed this briefly, whether you all can sample or not. I /thin/ the question of whether the sampling will work or not will depend on whether you can sample by unique device (browser) or not. If you can, sampling can work, but please verify. [17:01:27] miriam, leila: you can [17:01:57] nuria: thanks! can you sample by session id? [17:02:02] miriam, leila: thsi is how is normally done for AB tests , the unit of diversion is teh session which identifies a device, so it is actually quite easy to do that [17:02:09] cc bmansurov [17:02:18] bmansurov: miriam: If the load will be out of control, I suggest stopping the data collection now, going over the sampling code carefully and making sure we can analyze it, and then starting it again in a few hours. [17:02:39] nuria: perfect thanks [17:02:54] leila: the session id is the one we are using for analysis as well [17:02:56] nuria: we need consistent sessions across the period of the month, so miriam will need to look carefully in it to make sure that's possible. [17:02:59] EL load is ok now, and Andrew took care of the most pressing part that was mysql insertion rate (we blacklisted CitationUsage) [17:03:04] leila: deployment window just closed, not sure if I can find anyone to stop the data collection. [17:03:06] but it is a bit over the top :) [17:03:06] miriam: can you sample by session id? [17:03:28] bmansurov: just put change for next deployment window? [17:03:34] miriam: and can you confirm that session id is consistent for a device over the span of the month (except for the edge cases)? [17:03:40] leila: yes [17:03:47] leila: yes [17:04:09] nuria: ok [17:04:44] I am chatting with Timo now, we may havea slot now [17:04:46] elukey,nuria: which percentage of sampling would you suggest? [17:05:09] miriam: i would do 50% ad most [17:05:16] miriam: ciao! Just to have some numbers, this is what is happening now https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=5&fullscreen&orgId=1&from=now-1h&to=now-5m [17:05:34] miriam: session id is defined per wiki [17:06:06] bmansurov: can you join #wikimedia-operations please ? [17:06:20] miriam: I can't overemphasize this. Do look carefully over the sampling code. The method we used to sample for wwrw was not ideal, and that can make your analysis much harder later. ;) [17:06:46] leila:ok thanks [17:06:53] leila: agreed , from what i understand you want the sampling to be sticky for a session for how long? [17:07:05] elukey: ciao! Thanks! [17:07:09] cc miriam [17:07:15] 50% is still 1750 events / second [17:07:19] which is more than any other usage of eventlogging [17:07:26] the max right now is around 800-1000 / second [17:07:34] and that was the most we ever had when we deployed it (VirtualPageView) [17:08:04] nuria: where can I find the sampling code? [17:08:07] as FYI we are reverting the change [17:08:07] i think the system is fine, but it does beg the question of if you really need all of those event! (which is sounds like yall are discussing) [17:08:08] ottomata: you are right cause peak now is 2500 [17:08:26] is 3x VirtualPageView [17:08:30] oh sorry, that is 50% of all events [17:08:33] ottomata: thanks, this means we should go down to 25-30% [17:08:46] so 50% woudl be around 1k, about the same as virutalpageview [17:09:04] miriam: there are two modules eventlogging and wikimedia events [17:09:11] cc bmansurov [17:09:30] miriam: I'll give you the link in a sec [17:09:40] miriam: let me send links [17:09:49] bmansurvo,nuria: thanks [17:10:15] miriam: https://doc.wikimedia.org/EventLogging/master/js/source/subscriber.html#mw-eventLog-method-randomTokenMatch [17:11:22] miriam: that function is being called from https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/blob/master/modules/all/ext.wikimediaEvents.citationUsage.js#L283 [17:11:48] FYI, krinkle just deployed the change to stop data collection. [17:11:51] miriam: one issue we ran into in the past was that there was a known bug in one of the functions, I believe it was generateRandomSessionId. basically, what comes out of that is not fully random. [17:12:08] bmansurov: it is actually 80 bit token, need to correct that (note to self) [17:12:17] miriam: so while the code does make sense to you, I suggest being very critical of it and ask the hard question of what the function is actually doing. [17:13:07] leila: can you expand on that? [17:13:20] nuria: i see [17:13:23] * leila is looking for the phab task. [17:13:45] miriam: basically that function, was not choosing the sessionIds uniformly at random. [17:13:56] leila: mmmm [17:13:58] miriam: you'll see the function when you look at the code. [17:14:05] I am looking [17:14:18] leila: the sessionids are passed along to teh function, it calculates a mod [17:14:38] leila: over a pssed session [17:15:10] leila: if you do not pass a session, you will just get probabilistic sampling , is that what you are saying? (probabilistic as in 1/100) [17:15:39] nuria: trying to pull past conversations on that function. bear with me for a few min. [17:15:53] leila: thus your sampling is independent of session (if it is not passed along) [17:19:48] nuria: I /think/ the specific issue I had in mind was resolved after T112688 was addressed. [17:19:49] T112688: Bug: client IP is being hashed differently by the different parallel processors {stag} [13 pts] - https://phabricator.wikimedia.org/T112688 [17:19:52] * leila still reads [17:20:14] leila: let me read, teh randomizers used are used for crypto so really, they are random enough [17:20:43] nuria: and correct re sampling being independent of the session if session is not passed. [17:21:40] leila: ah i see, that is an entirely different problem (different codebase) [17:23:38] leila: i see, project was using hashing of ips as a random identifier, that is not random for several reasons (IPs are not equally distributed across our user base) but in this case on top of that there was a MAJOR BUG that was making same IP to hash to 2 different values. ayayay [17:24:18] leila: this code that we are looking at has nothing to do with IP hashing however, it used crypto api: https://developer.mozilla.org/en-US/docs/Web/API/Web_Crypto_API [17:24:27] nuria: yup. [17:24:28] leila: to assign a session to a device [17:24:43] leila: given that and a sampling rate you sample [17:24:47] leila: makes sense? [17:25:43] * leila checks crypto API [17:27:15] bmansurov: makes sense? we will be using the mw.sessionId to pass to the function you just sent to sample (with stickyness) from devices , let me know if you think I am missing something [17:27:17] nuria: can we jump in batcave? (I have a meeting in 3 min, want to quickly verify something with you) [17:27:24] yes, same here [17:29:56] nuria: let's have a chat after your meeting. [17:30:00] ok [17:30:43] bmansurov: it is a session cookie that will disapear once you reinitiate your browser, makes sense? [17:32:12] nuria: yes, I think so [17:32:42] nuria: it does (and my meeting got cancelled) [17:34:04] miriam: so I /think/ you should answer this question on your end: can you live with sampling based on a sessionId that resets when the browser closes. If you want to look at citation usage across a month for a period of one month, this may not be very reliable "alone". You may still want to use the hashed IP address and user_agent and look at webrequest logs data. [17:34:28] miriam: I don't know the exact questions y'all want to answer at the moment so can't give more specific comments. happy to help if you want to talk more. [17:35:49] leila, nuria: what we understood from the sessionID is that it resets when the cache is cleared? [17:36:01] bmansurov^ [17:36:13] miriam: yes [17:36:20] leila: that is why we were using it for sampling users [17:36:52] leila: because cleaning the cache is fairly uncommon [17:36:58] nuria: In https://doc.wikimedia.org/EventLogging/master/js/source/subscriber.html#mw-eventLog-method-randomTokenMatch we're slicing the first 8 characters of the 20 character token. Wouldn't that introduce a bias? [17:37:41] miriam: I agree with you that if resetting means cleaning the cache, you're good if you sample by session ID. [17:39:55] leila:ok, thanks a lto! [17:41:30] miriam: when you get a chance, let me know what you think about https://phabricator.wikimedia.org/T191086#4560343, thanks [17:42:23] specifically, can we drop some fields in order to reduce the data size [17:47:45] bmansurov: thanks for sharing this! Yes, we can drop page_title, and infer it form the page_id [17:48:20] srrodlund: I'm just about to pick up the work for the documentation. Is the goal to finish the taxonomy and start the prevelance? [17:49:22] srrodlund: I also had some conversations with the team about it. in the case of prevalence or the card after that, we may want to consider not create all the subpages, but instead link to the data in a github repository, give an example of how that data can be read, and let people get the data from there. [17:50:14] srrodlund: the challenge with putting all the data on wiki is that without insights, the data itself is not so useful in wiki format. Whoever wants to dig deep in it has to go back to the file format anyway, so we will doing a lot of work with no clear application (maybe) [17:50:17] miriam: can we also hash some of the fields such as 'referrer'? [17:50:47] Or do we need the actual values for those fields? [17:51:01] srrodlund: as a result, my suggestion is that we start with the cards and the immediate page that they point to, finalize that content today-Friday, and then start linking to github next week (or this week if end up with extra time) [17:51:19] leila: https://arxiv.org/abs/1702.03859 [17:54:04] miriam: hashing maybe problematic on the browser. Can we instead truncate long texts? Does it matter? Or can we ignore those 1% of dropped events? [17:56:19] dsaez: got you. I was looking at https://arxiv.org/pdf/1804.07755.pdf and was wondering if you meant that. [17:56:57] bmansurov: I think we can drop them. It's difficoult to hash or truncate refferers or links to domains, we might need the full url. [17:57:00] that's different [17:57:40] I've explained the technique that I'm using, during the research showcase [17:58:20] basically, you get some anchors or pivots point that you know that are the same in pair of languages, and do a linnear transformation for aling them [17:58:38] dsaez: yup yup. [17:58:52] back , let me read backscroll [17:58:59] * leila tries to keep track of all translation papers. [17:59:50] leila, miriam , bmansurov : hashed_ip and UA will be realiable for desktop but really not for mobile [18:00:31] leila, miriam , bmansurov : you can do a histograms of ips in mobile and see why, the NAT-ing in mobile conections makes many users access under same ip (when not on wif) [18:01:23] nuria: the IP issue on mobile exists, and is even stronger in the U.S. I did some data digging and the issue is not as bad in Europe, for example, at least parts of Europe. [18:01:50] nuria: user_agents also make the requests very specific in many cases. [18:01:56] leila: right because of mobile carriers [18:02:21] leila: it will be "bad" in any geographical location where there are few dominant carriers [18:03:00] nuria: from Why We Read Wikipedia I can tell that we're unable to match requests considering IP and UA only in around 5% of the time, sometimes 10%. never more. and while there are some differences between the ability to match desktop vs. mobile requests, our experience is that it's not as bad as expected on paper. [18:03:26] leila: on mobile data in the US or overall? [18:03:58] nuria: that research is language specific, but let's say in en that's our observation, and the majority of the traffic to en is from US [18:04:11] leila: but also desktop right? [18:04:27] nuria: correct. we have both platforms in the data. [18:04:27] leila: or does that figure include mobile requests? [18:05:15] leila: so the 10% not matched includes also mobile requests on us? [18:05:20] US that is.. ahem [18:05:21] nuria: not sure if you've had a look at it recently. UA has a lot of relatively unique information, even on mobile. [18:05:40] leila: right, that is why we parse it after 90 days [18:06:03] nuria: I have to look and see if the 10% is for the US or the 5%, nevertheless: the unmatched percentage includes all requests that are part of the study to both platforms (desktop and mobile). [18:06:13] nuria: yup. purging of course makes sense. [18:07:15] nuria: and we have stopped using the extra two features since the gain with them is not huge, but if you throw in browser language and referrer information (and build a session effectively), you can catch even more matches that you would miss otherwise. [18:13:00] leila: i see, ok. going with 10% figure I think the effect you will see from people switching ips mighty be as significant as turnaround of sessions due to people closing their browsers [citation needed]. I think that if sampling across a session is an option (cc miriam and bmansurov ) that is a lot more convenient than trying to compose signatures . Specially, i think cause -in the case of this experiment- you [18:13:00] are going to have to filter a of bot traffic not identified as such, that might be distributed to several ips. This later problem was not one on surveys as that experiment was not affected by bot data. [18:14:07] nuria: I'm fully with you that reconstructing sessions and devices is hard and noisy, and that's part of the reason you saw the 100% sampling. ;) [18:18:21] nuria: if session ids expire when the browser cache is cleaered as we understand, they are enough for us as unique identifiers. [18:19:24] nuria: bmansurov pointed out before that In https://doc.wikimedia.org/EventLogging/master/js/source/subscriber.html#mw-eventLog-method-randomTokenMatch we're slicing the first 8 characters of the 20 character token. Wouldn't that introduce a bias? [18:22:28] miriam: session will be renewed if you totally close your browser session too , as in kill process, not closing a tab, hopefully this makes sense, let me look at your 2nd question [18:25:32] nuria:thanks [18:25:35] Dario just shared https://www.scienceeurope.org/coalition-s/ and https://plus.google.com/+PeterSuber/posts/iGEFpdYY9dr in an internal thread. I realize that it's shared publicly already on Twitter, but for those of you like me who are not on Twitter, you may find the links helpful. [18:25:49] miriam: i see, javascript cannot deal with arbitrary large numbers. not even 2^64 so while the random identifier is so in a space of 2^80 is truly a "string" not an "integer" the mod needs to happen in a numerical context that javascript can understand so it needs to be coerced to <2^53 or something like that .. which mmmmm... makes me think now that tokens are 80 bytes that that method needs a fix (cc bmansurov ) [18:25:56] miriam: let me know if this makes sense [18:28:20] nuria: makes sense! do you think this might cause 2 sessions to be mapped to the same id? [18:30:13] nuria: I need to process all this information and report it to the researchers in the project -- is it possible to stop the data collection for now until we fixed the problem? the event that is generating so many entries is the pageload event. I'd like to understand from them if they prefer to sample 30-50% of sessions, or drop the pageload event and get back to our previous rate of 150 events/second [18:30:58] nuria,bmansurov,leila, btw, thanks a lot :) [18:30:58] miriam: it mathematically means that your session space when doing sampling is coerced from 2^80 to 2^53 , given that our number of devices in a wiki in a month is less than 2^53 i do not think is an issue. [18:31:07] cc bmansurov [18:31:16] nuria: perfect, thanks :) [18:39:08] nuria,bmansurov: I need to leave for now. please stop the data collection if you have a chance today. I'll talk with the researchers and get back to you with one of the 2 options 1) sampling at 50% with current schema (~1000 event/sec) 2) sampling 100% without the pageload event (~150 events/sec) [18:39:29] miriam: correction, number is not 2^53 but rather 2^32, nuria needs to correct code that buckets sampling [18:40:37] nuria: ok, thanks! [19:23:09] leila: I think, I've found the problem with the translator [19:53:04] 10Quarry, 10Cloud-Services, 10Community-Wikimetrics, 10DBA, and 2 others: Evaluate future of wmf puppet module "mysql" - https://phabricator.wikimedia.org/T165625 (10Dzahn) T202588 exists for the quarry migration. that will unblock a lot of this. Also T162070 is a duplicate of this ticket in a way. [20:20:21] dsaez: which translator? [20:20:32] my algorithm [20:20:36] dsaez: ow the fact that the scores would sometimes go down? [20:20:50] yep, check your email [20:21:00] I'm signing off now [20:21:11] dsaez: reading email now. good night. [20:25:48] dsaez: for your tomorrow, responded in the email. it's great to see that the performance is boosted significantly. Gathering a good training set remains one of our top challenges. [20:26:31] leila: I'm trying to see how we can capture page title with external quick surveys. I heard you were able to capture titles in some previous survey. Any links? [20:28:51] bmansurov: isn't it part of the schema already? [20:29:15] bmansurov: https://meta.wikimedia.org/wiki/Schema:QuickSurveysResponses [20:29:20] leila: thanks [20:29:30] np [20:46:04] a quick note that Elijah Mayfield and Alan Black from CMU are working on a project about the immediate and delayed effects of policy-driven decision making in online communities. They'd like to start with Wikipedia as the first example they look at and they're looking for feedback. I suggested to Elijah to open a page on meta and seek feedback through that page, I want to make sure they don't get blocked on me (and the best kno [23:44:09] quiddity: any chance you can activate the translation on https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Reader_Behaviour/Taxonomy_of_Wikipedia_use_cases#Taxonomy_of_Wikipedia_readers ? [23:44:59] quiddity: ideally I want to restrict it certain languages at the moment, but I think the control you showed me yesterday would allow the opposite, as in: you could say which languages shouldn't it be translated to [23:50:42] leila, I can, and the way you want does work. Which languages? [23:51:22] * leila pulls up the list [23:51:24] (and should I remove the {draft} template?) [23:51:48] quiddity: for now keep it for some more hours. I want Sarah to get a chance to review the rest of the page tomorrow before we remove that. is that okay? [23:51:56] ofc :) [23:52:00] great. :) [23:53:05] quiddity: the languages to have it enabled in are: ar, bn, de, en, es, he, hi, hu, ja, nl, ro, ru, uk, zh [23:53:22] * leila is super excited. :D [23:59:02] quiddity: OH! it's there. thanks! [23:59:09] :) [23:59:34] ping me if there are any problems, and I can help fix, or disable if need be.