[13:04:25] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "Thanks for the additions and changes!" (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/317172 (https://phabricator.wikimedia.org/T146560) (owner: 10Ladsgroup) [13:08:17] (03CR) 10Ladsgroup: Add CacheTest.php (was Extensive CI tests, part III) (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/317172 (https://phabricator.wikimedia.org/T146560) (owner: 10Ladsgroup) [13:08:37] (03PS6) 10Ladsgroup: Add CacheTest.php (was Extensive CI tests, part III) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/317172 (https://phabricator.wikimedia.org/T146560) [14:24:41] 06Revision-Scoring-As-A-Service, 10revscoring: Implement sentences datascources & experiment with normalization. - https://phabricator.wikimedia.org/T148867#2738048 (10Halfak) a:03Halfak [14:31:17] Attention! Today is the last day to apply for travel sponsorship for the Wikimedia developer summit! [14:35:31] Apply here: https://docs.google.com/forms/d/e/1FAIpQLSeIKSjrLqYlDsaIlUmEj4Gf7UgC9J0YtgvjByHOxTM1mRmfUQ/viewform [14:54:02] 06Revision-Scoring-As-A-Service: Clean up ORES service documentation - https://phabricator.wikimedia.org/T148974#2738206 (10Halfak) [14:54:25] 06Revision-Scoring-As-A-Service, 10ORES: Clean up ORES service documentation - https://phabricator.wikimedia.org/T148974#2738220 (10Halfak) [14:54:35] 06Revision-Scoring-As-A-Service, 10ORES: Clean up ORES service documentation - https://phabricator.wikimedia.org/T148974#2738206 (10Halfak) [14:55:40] 06Revision-Scoring-As-A-Service, 10ORES: Clean up ORES service documentation - https://phabricator.wikimedia.org/T148974#2738206 (10Halfak) @awight I've been thinking about what will appear at https://mediawiki.org/wiki/ORES. Why would we put anything there rather than just having it in the [15:12:40] 06Revision-Scoring-As-A-Service, 10ORES: Clean up ORES service documentation - https://phabricator.wikimedia.org/T148974#2738327 (10awight) I'm sure that would be fine. The wiki page would potentially enable translation, but that's not a problem we have at this point. Subpages and embedded images in the read... [15:15:21] 06Revision-Scoring-As-A-Service, 10ORES: Clean up ORES service documentation - https://phabricator.wikimedia.org/T148974#2738342 (10Halfak) I wonder if we could do a good mixture. So, we have: * https://mediawiki.org/wiki/ORES (Long-form discussion of architectural strategy with diagrams) * https://github.co... [15:19:41] 06Revision-Scoring-As-A-Service, 10ORES: Clean up ORES service documentation - https://phabricator.wikimedia.org/T148974#2738362 (10Halfak) I just extended https://etherpad.wikimedia.org/p/ORES_docs_split to include these two additional documentation spaces. [15:31:38] 06Revision-Scoring-As-A-Service, 10ORES: Clean up ORES service documentation - https://phabricator.wikimedia.org/T148974#2738206 (10Ladsgroup) We can use .wiki files in github. Let me grab an example [15:39:06] 06Revision-Scoring-As-A-Service, 10ORES: Clean up ORES service documentation - https://phabricator.wikimedia.org/T148974#2738428 (10Ladsgroup) https://github.com/wmde/WikibaseDataModel/blob/master/docs/foreign-entity-ids.wiki [16:17:25] 06Revision-Scoring-As-A-Service, 10ORES: Create a tools project for hosting ORES datasets (in a labsDB database) - https://phabricator.wikimedia.org/T146722#2738621 (10Halfak) I talked to @yuvipanda and @chasemp about this. We came to the conclusion that there's no clear recommendation. So I wrote up an ethe... [16:35:00] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES, 15User-Ladsgroup: Visually report damaging confidence - https://phabricator.wikimedia.org/T144922#2738710 (10Halfak) Note, we talked about this in the revscoring meeting and we determined that three thresholds should be surfaced through a config v... [16:36:54] 06Revision-Scoring-As-A-Service, 10AbuseFilter, 10MediaWiki-extensions-ORES, 15User-Ladsgroup: [Spike] Investigate building a hook for abuse filter - https://phabricator.wikimedia.org/T123178#1922959 (10Halfak) a:05Ladsgroup>03None [16:37:35] 06Revision-Scoring-As-A-Service, 10AbuseFilter, 10MediaWiki-extensions-ORES, 15User-Ladsgroup: [Spike] Investigate building a hook for abuse filter - https://phabricator.wikimedia.org/T123178#1922959 (10Halfak) @He7d3r, would you like to take on this task? [16:38:31] 06Revision-Scoring-As-A-Service, 10Wikilabels, 15User-Ladsgroup: Revision not found error unformatted and not localized - https://phabricator.wikimedia.org/T139587#2738726 (10Halfak) https://github.com/wiki-ai/wikilabels/pull/136 [16:39:25] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality : Update wikiclass for revscoring 1.3.0 - https://phabricator.wikimedia.org/T147201#2738728 (10Halfak) Title now changed. Everything is ready for review. See https://github.com/wiki-ai/wikiclass/pull/30 [16:48:29] 06Revision-Scoring-As-A-Service, 10ORES, 13Patch-For-Review, 15User-Ladsgroup: Move from mediawiki/services/ores/deploy to research/ores/deploy or research/ores/deploy-prod - https://phabricator.wikimedia.org/T139008#2738791 (10Halfak) @mmodell, can you take a look? Should mirroring be working? [16:50:32] halfak: Hey, I wanted to check whether this docs thing is a good task to nibble at, or if there are still loose ends wrt the high load spikes. Was that really dkinzler's bot, or is it unknown? [16:55:27] 06Revision-Scoring-As-A-Service, 10ORES: Clean up ORES service documentation - https://phabricator.wikimedia.org/T148974#2738802 (10Halfak) @Ladsgroup and I were looking at the etherpad and we realized that mediawiki.org is in a weird spot because the extension/beta feature directs users to the MediaWiki.org p... [16:59:45] 06Revision-Scoring-As-A-Service: What's going on with ORES logs? - https://phabricator.wikimedia.org/T148436#2738819 (10Halfak) @akosiaris, do you know what's going on here and how we might fix it? [17:00:39] 06Revision-Scoring-As-A-Service: What's going on with ORES logs? - https://phabricator.wikimedia.org/T148436#2738823 (10akosiaris) That looks like logrotate mess. I 'll have a look [17:01:08] 06Revision-Scoring-As-A-Service, 10ORES: [Discuss] DOS attacks on ORES. What to do? - https://phabricator.wikimedia.org/T148347#2738825 (10Halfak) It looks like we just doubled capacity. And we've filed a task for {T148594} I think that we're done with this [Discuss]ion for now. [17:02:41] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES: hidenondamaging=1 on Special:Contributions fails to filter out Flow contributions - https://phabricator.wikimedia.org/T146851#2672810 (10Halfak) >>! In T146851#2679585, @Ladsgroup wrote: > Does your gerrit change fix this? @Catrope ? [17:03:07] 06Revision-Scoring-As-A-Service, 10ORES: Create a tools project for hosting ORES datasets (in a labsDB database) - https://phabricator.wikimedia.org/T146722#2738831 (10Halfak) 05Open>03declined [17:28:15] awight, I think the spike is unknown. [17:28:50] It would be good to have a way to track and query user-agents so that we can shut that kind of thing down [17:28:55] kk, I might poke at that too, then. [17:29:21] Amir1 is going to look into the prioritized queue strategy [17:29:30] He did take some notes on some work alreadu [17:29:56] I'm going to run out to grab lunch. I'll be AFK for 45 mins [17:31:08] Any thoughts (after lunch ;) about generally throttling users? [17:34:17] huh, there aren't any client throttling modules for celery jumping out at me. [17:48:23] Yeah, I was afk [17:52:38] Amir1: Hi! I was just rambling about generic throttling for API users [17:52:50] Donno if there's any way to do that without authentication, though. [17:53:42] But it sounds like the prioritized queue idea will solve the problem, so non-whitelisted clients can only make requests at some reasonable rate? [17:53:43] I think (very naively) the design in mind for celery is to have some kind of throttling by using queues [17:54:35] So building other stuff would complicate things, is it worth it or not. I don't know that part [18:02:13] sounds perfect--I think. [18:03:27] So you'll assign high-priority clients to a specific queue? [18:16:50] adamwight, for throttling, I think we should be looking at uwsgi more than celery [18:17:36] But yeah, I like the queue-based strategy with celery for prioritizing. [18:20:16] I'm unclear on how that works--so maybe the idea is that e.g. half the worker capacity is reserved for the high-priority queue? [18:20:32] If it's figured out already, don't mind me... [18:21:17] I see http://docs.celeryproject.org/en/latest/faq.html#does-celery-support-task-priorities [18:21:39] that seems like a waste of resources, though [18:23:15] adamwight, more that all of the capacity goes to all of the requests until our backlog of requests starts to pile up (just a little bit) [18:23:49] Then, only people who include a user-agent that can be tracked are allowed to have their requests fulfilled -- until the queue gets moderately full. [18:23:59] Then only our precachers and MediaWiki make it through. [18:24:13] So, we'll always make full use of resources [18:24:19] Until we become over-capacity [18:24:27] So it's really not like a priority queue. [18:24:37] It's first come, first serve until shit gets crazy [18:25:18] That sounds like we're still reserving capacity... [18:26:13] With what you described, you're setting a threshold on how much of our capacity can be used by anons, and the remaining 25% or whatever will be underutilized [18:26:28] This strategy also seems to punish all anons when one user is going bonkers [18:26:40] adamwight, when requests back up, we're at 100% capacity. [18:26:45] ok i see [18:26:49] The queue is almost always at zero [18:29:08] Once our backlog hits the first (lower) prioritizing threshold, do we stop serving anons entirely? [18:30:48] adamwight, yeah. Or rather, we'll start handing out 503s until things get manageable again. [18:31:01] hum. cool [18:31:07] Anons == People who make requests without an email address in the user agent [18:31:30] I've never dealt with this kind of loady engineering, it's interesting though. [18:31:35] right [18:32:09] It would be nice if we could enforce some rules too. E.g. no more than 4 parallel requests from hosts that are not our own. [18:33:20] yah instinctually I'd like to just have a reasonable rate limit per client and dial that down under high load, but I don't think there's an easy way to do it, other than use IP as a proxy for unique client. [18:33:31] gotta walk the dog for a few... [18:36:30] awight|doggie, when you get back, I'm wondering if you know how we do the IP-based rate limiting for other services. [18:37:27] * halfak asks apergos in -research. [18:43:24] 06Revision-Scoring-As-A-Service, 10ORES: Implement parallel connection limit for querying ORES - https://phabricator.wikimedia.org/T148997#2739205 (10Halfak) [18:43:47] awight|doggie, for when you get back, some notes from talking to apergos: https://phabricator.wikimedia.org/T148997 [18:47:18] Can you remind me why we aren't using varnish btw? [18:47:20] Do you know if the recent changes get many hits? [18:52:58] varnish caching seems like it would be even lower latency, though probably wouldn't eliminate any complexity in the scoring workers [18:53:51] I'm confused about latency--you were saying something like 20ms, but I saw mostly TTFB > 1s in the web_request table [18:57:24] adamwight, we do use varnish. [18:57:40] We don't use varnish caching because we end up having invalidation problems [18:58:04] So when we update a model, how do we clear the cache in varnish for just that model? [18:58:16] If we could just clear it when any model was updated, I'd be cool with that. [18:59:32] right, I was thinking that too [19:01:33] I believe that akosiaris advised that clearing the cache on each deploy for just ores.wikimedia.org would not be nice and easy. [19:06:07] I can believe it [19:06:07] Turns out there is a way, http://twigstechtips.blogspot.com/2014/04/varnish-enabling-wildcard-purging-of.html [19:06:07] but I donno if it's expensive [19:12:38] Amir1, please review https://etherpad.wikimedia.org/p/ores_weekly_update [19:14:31] 10Revision-Scoring-As-A-Service-Backlog, 10ORES: Implement selectivg purging of model scores in varnish - https://phabricator.wikimedia.org/T148999#2739288 (10Halfak) [19:14:38] 10Revision-Scoring-As-A-Service-Backlog, 10ORES: Implement selective purging of model scores in varnish - https://phabricator.wikimedia.org/T148999#2739301 (10Halfak) [19:14:41] adamwight, https://phabricator.wikimedia.org/T148999 [19:14:55] I put it on the backlog so we can get to it if we get to it. [19:14:57] cool. Wow, your TTFB for a cache hit is 1ms. We don't have a latency problem for hits, although varnish would take load off of the workers [19:15:21] I'm poking through logs during the T148347 spike fwiw, 2016-10-17 06:*:* [19:15:22] T148347: [Discuss] DOS attacks on ORES. What to do? - https://phabricator.wikimedia.org/T148347 [19:16:02] \o/ SPEEEED [19:16:19] interesting--why would we be seeing cache hits for WMF-initiated precache requests? [19:17:02] wait--and web_request.cache_status must be reporting something about headers we send from the app server? Cos there is no frontend web caching? [19:17:44] Cache miss TTFB is awful during that hour, as you know. lots of 16 second delays [19:18:35] We timeout at 15 seconds. [19:18:43] ooh [19:19:06] For WMF-initiated precaching, we shouldn't see cache hits. [19:19:16] That's what I'd have thought [19:19:26] lots of hits [19:19:39] I have a file on stat1002 if you want to grab it [19:19:59] Can't now :( [19:20:03] ~awight/T148347-2016101706.tsv [19:20:05] no rush! [19:21:48] halfak: I was afk for some stuff [19:21:51] I do it right now [19:21:55] Great! Thank you [19:23:05] halfak: There are some stuff that we did but they got resolved [19:23:12] I forgot to tell you about them [19:23:33] This one https://phabricator.wikimedia.org/T147734 ? [19:23:41] yes [19:24:56] OK. I added it. [19:24:57] Please review the statement on line 15 [19:25:22] One last thing I forgot, We should talk about our doubling capacity [19:27:16] Do we have a card for that? [19:27:19] I guess not [19:27:51] https://phabricator.wikimedia.org/T148380 [19:28:04] Maybe https://phabricator.wikimedia.org/T147903 is what I want [19:28:47] halfak: https://phabricator.wikimedia.org/T147903 [19:28:56] :P [19:30:38] OK anything else before I send it out? [19:30:41] Amir1, ^ [19:30:44] Awesome [19:30:46] great [19:30:47] thanks [19:32:47] sent! [19:32:52] PREPARE FOR SPAM STREAM [19:33:25] 06Revision-Scoring-As-A-Service, 10ORES: [Discuss] DOS attacks on ORES. What to do? - https://phabricator.wikimedia.org/T148347#2739360 (10Halfak) 05Open>03Resolved a:03Halfak [19:33:27] _o/ [19:33:27] 06Revision-Scoring-As-A-Service, 10ORES, 13Patch-For-Review, 15User-Ladsgroup: Send celery logs to /srv/log/ores instead of /var/lib/daemon.log - https://phabricator.wikimedia.org/T147898#2739362 (10Halfak) 05Open>03Resolved [19:33:29] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality : Update wikiclass for revscoring 1.3.0 - https://phabricator.wikimedia.org/T147201#2739363 (10Halfak) 05Open>03Resolved [19:33:32] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality : [Discuss] Hosting the monthly article quality dataset on labsDB - https://phabricator.wikimedia.org/T146718#2739364 (10Halfak) 05Open>03Resolved [19:33:33] _o_ [19:33:33] 06Revision-Scoring-As-A-Service, 10ORES: Investigate memory leak in precached - https://phabricator.wikimedia.org/T146500#2739365 (10Halfak) 05Open>03Resolved [19:33:35] 06Revision-Scoring-As-A-Service, 10Wikilabels, 15User-Ladsgroup: Revision not found error unformatted and not localized - https://phabricator.wikimedia.org/T139587#2739368 (10Halfak) 05Open>03Resolved [19:33:37] \o/ [19:33:37] 10Revision-Scoring-As-A-Service-Backlog, 10MediaWiki-extensions-ORES: Request scores when someone checks out edits that are not stored in ores_classification - https://phabricator.wikimedia.org/T143612#2739369 (10Halfak) [19:33:40] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES, 13Patch-For-Review, 15User-Ladsgroup, 05WMF-deploy-2016-10-25_(1.28.0-wmf.23): Embed machine readable ores scores as data on pages where ORES scores things - https://phabricator.wikimedia.org/T143611#2739366 (10Halfak) 05Open>03Resolved [19:38:23] 10Revision-Scoring-As-A-Service-Backlog, 10ORES: Implement selective purging of model scores in varnish - https://phabricator.wikimedia.org/T148999#2739288 (10awight) Here's another resource for understand Varnish bans: https://www.smashingmagazine.com/2014/04/cache-invalidation-strategies-with-varnish-cache/... [19:53:24] * adamwight dons shade 10 goggles [20:00:27] halfak: I have two things to deploy [20:00:34] https://gerrit.wikimedia.org/r/#/c/316048/ [20:00:50] https://gerrit.wikimedia.org/r/#/c/317326/ [20:01:24] is it okay to deploy? [20:01:59] 06Revision-Scoring-As-A-Service, 10ORES: [Discuss] DOS attacks on ORES. What to do? - https://phabricator.wikimedia.org/T148347#2739458 (10awight) I'm still trying to understand the cause of the spike, and would like help interpreting a few things. ``` select * from wmf.webrequest where uri_host = 'ores.... [20:03:23] Amir1 a quick question for you [20:03:29] what is the family for wikimanias [20:03:35] is it no wm2012 or something? [20:03:38] for pywiki [20:03:55] ToAruShiroiNeko: I'm not sure, let me grep them [20:04:13] I was looking at family.py and my info is wrong :/ [20:04:23] or maybe the language isnt wm2012 [20:04:43] ToAruShiroiNeko: there's no pwb support on wikimania wikis [20:04:45] I cant seem to process chapters either [20:04:58] you need to build family files using generate_family_file.py [20:05:09] Amir1 oh [20:15:40] Amir1, sorry was AFK. Not ready for an ORES deploy yet [20:15:54] Ph wait. [20:16:01] Yeah. Those are OK to deploy [20:16:53] 06Revision-Scoring-As-A-Service, 10ORES: [Discuss] DOS attacks on ORES. What to do? - https://phabricator.wikimedia.org/T148347#2739516 (10awight) Here's the period of poor performance: https://grafana.wikimedia.org/dashboard/db/ores?from=1476617719415&to=1476734773459&panelId=15&fullscreen [20:18:43] okay halfak, we deploy after parsiod deploy [20:18:50] kk [20:37:49] Can anyone explain the difference between "total scoring requests (including cache)" vs "scoring requests"? [20:38:26] I was imagining that it meant * total requests received, including those served from cache, vs * uncached scoring requests processed [20:38:39] 06Revision-Scoring-As-A-Service, 10ORES, 15User-Ladsgroup: Send ORES logs to logstash - https://phabricator.wikimedia.org/T149010#2739588 (10Ladsgroup) [20:38:59] however, that doesn't match the "cache hit rate" graph. [20:42:27] 06Revision-Scoring-As-A-Service, 10ORES: [Discuss] DOS attacks on ORES. What to do? - https://phabricator.wikimedia.org/T148347#2739639 (10awight) I'm suspicious of the "burstiness" explanation for this episode. The trouble starts around 2016-10-16 18:45, when there is a flurry of timeout errors, not caused... [21:01:42] 06Revision-Scoring-As-A-Service: Review ORES Grafana metrics - https://phabricator.wikimedia.org/T149015#2739711 (10awight) [21:05:45] 06Revision-Scoring-As-A-Service: Review ORES Grafana metrics - https://phabricator.wikimedia.org/T149015#2739765 (10awight) [21:06:51] adamwight, I've got one more meeting and then I'll take a look with you [21:07:13] I'll probably be asynchronous today & mostly trying to make sense for my own sanity [21:07:19] But do ping me when you get to it! [21:11:10] 06Revision-Scoring-As-A-Service: Review ORES Grafana metrics - https://phabricator.wikimedia.org/T149015#2739829 (10awight) [21:17:46] 06Revision-Scoring-As-A-Service: Review ORES Grafana metrics - https://phabricator.wikimedia.org/T149015#2739893 (10awight) Documenting things I find: https://www.mediawiki.org/wiki/ORES/Metrics [21:21:31] gtg [22:52:57] OK. Just got done [22:53:02] I'm talking to the scrollback. [22:53:09] So... let's look at cache hit rates. [22:54:34] So! We have a few things that are funny. [22:54:58] Some types of requests skip the cache (e.g. if you request that the set of features be returned to you with "?features") [22:55:13] Those are super duper rare, but they don't get counted towards "misses" and "hits" [22:55:16] See https://github.com/wiki-ai/ores/blob/master/ores/scoring_systems/scoring_system.py#L256 [22:55:25] For where "misses" and "hits" get counted. [22:55:46] Here's where we check if "?features" was requested: https://github.com/wiki-ai/ores/blob/master/ores/scoring_systems/scoring_system.py#L66 [22:57:10] So, for all requests regardless of what was requested, we record a "score_request" or a "precache_request" depending on whether "?precache" was in the request. See https://github.com/wiki-ai/ores/blob/master/ores/scoring_systems/scoring_system.py#L49 [22:57:53] OK. Now I'm reviewing the metrics that adamwight pointed me to