[00:11:57] 10DBA, 10AbuseFilter: DBA review of purgeOldLogIPData.php - https://phabricator.wikimedia.org/T186973#3960968 (10Huji) p:05Triage>03High Considering that we have been storing IP data in `abuse_filter_log` for a very long time (i.e. WMF has violated its own retention policy), I am assigning a high priority... [00:17:16] 10DBA, 10AbuseFilter: DBA review of purgeOldLogIPData.php - https://phabricator.wikimedia.org/T186973#3960880 (10Huji) Please also see my comment at T186870#3960970 which, if calculated correctly, indicates we will be dealing with an update of hundreds of millions of rows. [11:37:16] 10DBA, 10AbuseFilter: DBA review of purgeOldLogIPData.php - https://phabricator.wikimedia.org/T186973#3961317 (10MarcoAurelio) Yeah. I guess I got short in my counter: ``` MariaDB [enwiki_p]> select count(*) from abuse_filter_log; +----------+ | count(*) | +----------+ | 20402866 | +----------+ 1 row in set (... [11:39:13] 10DBA, 10AbuseFilter: DBA review of purgeOldLogIPData.php - https://phabricator.wikimedia.org/T186973#3961318 (10MarcoAurelio) Spanish WIkipedia data: ``` MariaDB [eswiki_p]> select count(*) from abuse_filter_log; +----------+ | count(*) | +----------+ | 8485964 | +----------+ 1 row in set (5.30 sec) MariaD... [22:00:27] 10DBA, 10AbuseFilter: DBA review of purgeOldLogIPData.php - https://phabricator.wikimedia.org/T186973#3960880 (10jcrespo) Please set an ORDER BY, the LIMIT without an order by can lead to different results on masters and replicas- while you can argue that the same thing will be eventually deleted even if the q... [22:06:05] jynus: thanks for the quick look; hopefully somebody will amend the script as appropriate [22:07:21] is that an extension? [22:07:23] is that new? [22:07:40] do you know who is the maintainer? [22:08:13] because that doesn't look like a sane functionality to me [22:08:22] jynus: it's for AbuseFilter, no active maintainers (it's on code stewardship review fwiw) [22:08:45] AbuseFilter stores in abuse_filter_log an afl_ip row [22:09:02] while I understand it can be a helpful tool for admins [22:09:15] afl_ip values for rows older than 90 days must be deleted per our privacy policy [22:09:22] how it behaves is precisely why tools shouldn't be left unmantained [22:09:40] I am going to guess that predates the policy? [22:09:49] although [22:09:50] nope, it was likely an oversight [22:10:07] are those real ips? [22:10:09] it's been logging data and storing it since 2010+/- [22:10:12] or account-ips? [22:10:13] claro [22:10:28] IP addresses of everyone triggering a filter [22:10:38] as in, ips for people with accounts (private) [22:10:51] or ips for people without accounts (public) [22:10:58] abusefilter-private was disabled so normal users couldn't access the feature [22:11:21] the IPs of everyone triggering a filter, either unregistered and registered [22:11:32] so it indeed includes private data [22:11:37] indeed [22:12:00] I will ping my manager, probably will have to invovle lawyers to give that unbreak now priority [22:12:17] note that, as DBA, we normally not handle content [22:12:39] just service, it is the responsability of the developer to take care of the data [22:12:48] so the point is: checkuser/steward folks would benefit from knowing sometimes which is the originating IP triggering a filter: a new interface was added so that such access is logged but we can't keep data older than 90 days there [22:12:49] but if there is none [22:13:07] I will escalate this, as that is a bigger issues [22:13:18] james alexander is working with us in getting this rolling [22:13:38] sure, I just think this is a really important issue [22:13:39] jynus: I can explain further on PM in my mothertongue should you need more info [22:15:12] the thing was worse ab initio: abusefilter can provide unlogged access to IP addresses of users triggering any filter [22:15:14] oh, I see it is already handlwed [22:15:33] this is why the 'abusefilter-private' priv is disabled [22:15:38] do you think it is so bad as to delete those views from labs? [22:16:00] I think afl_ip from abuse_filter_log do not replicate to labs [22:16:05] ok [22:16:08] let me check [22:16:17] ssh -A maurelio@dev.tools.wmflabs.org [22:16:20] oops [22:16:24] I'm dense [22:16:46] as long as you do not pring here the passsword :-) [22:16:55] *print [22:17:36] oh, I see, it was probably counted via logging [22:17:48] but the actual stuff shouldn't be available [22:17:53] MariaDB [metawiki_p]> select * from abuse_filter_log where afl_id=2500; for example [22:18:02] afl_ip == NULL [22:18:06] as should be [22:18:07] yeah [22:18:19] ok, so, let's fix that [22:18:22] you can see the IP for example on the 'master' table [22:18:39] which I think it is already beeing handled [22:18:39] I can do the same with wikiadmin@... on beta [22:18:49] I thoungt it was new [22:18:58] then we maybe should be talk about logging [22:19:08] the amount of logging happening may be underisable [22:19:16] not sure why afl_wiki is NULL though [22:19:18] *undesirable [22:19:38] there are fields that are set to NULL [22:19:49] jynus: is it okay that I purge the data on the beta cluster now with foreachwiki ... ? [22:19:50] some are not immediataly aparent [22:20:08] Hauskatze: I would say we should wait for having a proper patched script [22:20:14] perfect [22:20:17] and use that as a test [22:20:57] I did https://phabricator.wikimedia.org/T186870#3960746 as a test [22:21:14] beta eswiki is closed since years ago [22:21:30] but after that I didn't any more tests pending reviews [22:25:43] I don't think logging will have any data? The abuse filter log uses its own table afaics [22:26:15] ok, then [22:26:37] I missunderstood everytime that triggers, an entry on logging is added [22:27:11] yes, every time an abusefilter is triggered, an entry is added to the abuse_filter_log table [22:27:51] so, to sum up, we want afl_ip data from rows older than 90 days gone (only the afl_ip thing) [22:28:24] first run I guess it'll be manual, later runs can be scheduled via puppet cron as we do with the checkuser extension [22:28:26] ok, so maybe abuse_filter_log would need a trim, not for private data, but for clean up? [22:28:45] (not sure, just suggesting as a followup to research) [22:29:25] my limited to scarce knowledge can't help answer that question :-) [22:42:33] 10DBA, 10MediaWiki-Parser, 10Performance-Team, 10MediaWiki-Platform-Team (MWPT-Q1-Jul-Sep-2017): WMF ParserCache disk space exhaustion - https://phabricator.wikimedia.org/T167784#3961866 (10jcrespo) > reduction in disk usage Logical disk reduction, filesystem level requires defragmentation- ping me (but b... [22:51:10] 10DBA, 10Operations, 10Performance-Team, 10Patch-For-Review, 10codfw-rollout: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523#3961878 (10jcrespo) See my latest comments on: T167784#3961866 > The third one is a bigger question regarding active-acti...