[12:01:33] Hi all, wondering if RevisionStore can be used to fetch page content from a different wiki? Revision::getRevisionText takes a $wiki parameter. [13:20:14] Probably under certain circumstances, yeah [13:20:31] Not any random wiki on the internet, presumably though [13:32:23] Reedy, yeah, I should have been specific, I meant from a wiki in the same cluster. [14:44:21] How often does a wiki's Special:Statistics data get updated and what triggers it? I'm digging through the MW source trying to figure this out but no dice so far. I run initSiteStats.php --update hourly against my wikis and just trying to better understand the statistics system, now that I have collectd's curl_json plugin gathering that data to send [14:44:22] to graphite. [14:45:58] You shouldn't need to do that [14:46:10] You should only need to run initSiteStats.php when there's drift [14:46:39] Things like users, pages, images etc should increment when new ones are created [14:47:37] A few years ago I wrote a script for a colleage that generates CSV data of certain stats that he maintains for business purposes, and that's when I learned that they get updated periodically so I used the php script to ensure they were updated before running the data gathering script. [14:49:45] So maybe some stats are updated more frequently than others? One example is the recent work I'd done on deleting outdated semantic properties. Now that it runs nightly, even though the deletions are steady, last night's ran for about 2 hours cleaning around 4k entries, the graphs during the cleanup script are stairs, flat for an hour until the [14:49:45] script runs and then dropping by a huge amount. So I'm guessing this is because that particular stat, at least (provided by SMW and not part of core MW) isn't updated on every action? [14:50:03] I'd have nfi about SMW stuff [14:50:35] Job queue jobs all running? [14:50:44] Yeah I know it's relatively niche and the SMW channel is super low traffic, though I did join that mailing list so I can ask deeper questions there. [14:51:32] abijeet: Worth noting that Revision::getRevisionText is deprecated... Which is probably why you were asking about RevisionStore instead [14:51:58] Yeah, I monitor the job queue. In fact right now I'm gathering the job queue size two ways via collectd, dbi plugin doing a direct count() of the job table and the curl_json plugin reading statistics. The two graphs are very similar but the dbi one is more precise. [14:52:58] dbi is more precise, btw, because it's checks every 10 seconds vs. 60 for the curl (plugin tunable). [14:54:24] That said we can often get huge job queues very quickly. The largest I've seen was most recently, one of our wikis had a couple of templates updated and it spiked ultimately to about 140k jobs, took about 5 days for that to clear with jobRunRate set to 0.5 [14:55:15] That was rare but we often see job queue sizes of tens of thousands when widely-used templates are updated [14:55:41] Why do you think said templates on the wikipedia sites etc have warnings about editing them... :D [14:56:32] I don't edit them, our community, really a handful of editors who manage the more complex aspects of our wikis' content are the ones who manage them. [14:56:55] Occasionally they need to be tweaked due to game updates or other changes that require fixing or updating templates. [14:59:03] if you have a relatively big wiki, jobRunRate should be 0 and run jobs from command line in a job runner [15:00:59] Our wikis are pretty darn big and improving our PHP-FPM environment, probably splitting it into its own farm off the Apache servers (same with Varnish) and implementing a runner is part of the ideas for my big Wiki Redesign project for this year, now that my 1.30 -> 1.34 upgrades are done. [15:01:28] https://wikiapiary.com/wiki/Guild_Wars_2_Wiki_(en) [15:04:05] I migrated the wikis from our on-prem datacenters a couple of years ago to AWS, basically a lift and shift, and then was gone for a while and now that I'm back this was my first big upgrade while in AWS. The next should be to 1.36 (including SMW, which went from 2.5 to 3.1 this time) in the first quarter of 2021. So before that, I'm taking the [15:04:05] opportunity to better leverage AWS features and redesign the wiki servers, etc. with more EC2 instances that are better tuned to app needs and performance, etc. [15:05:38] Right now Varnish, Apache, and PHP all run together on each in a set of EC2 instances. This is also why I've been looking for a way to split Varnish into its own system, which is why your recommendation of how Wikipedia does it with multicast htcp purging will be so important. [15:15:24] out of curiosity, how many varnish servers/services do you have for the wiki? [15:17:18] Actually only 4 and they're not even that big, r5a.xlarge instance types with 4 vCPUs and 32 GB RAM. [15:18:12] That is, there are 4 EC2 instances each running a Varnish/Apache/PHP-FPM stack, behind a (well 3) AWS ALBs. [15:39:16] Vulpix btw fwiw there are 7 wikis running on these servers, each as its own vhost. [15:41:43] I think using $wgSquidServers should be sufficient for them. multicast htcp would be needed if you had a more large list of varnish to purge, or with varying IPs [15:52:17] I do use $wgSquidServers now, with the IP addresses of all 4 EC2 instances. If one were to need to change, however, I'd have to go through my workflow for updating the config in Salt and getting to my MW "build server" then deploying that to the Salt servers. Having an unchanging name/address that can be in that variable would be much simpler and [15:52:17] cleaner. [15:53:11] Plus there's an inefficiency in having 4 independent varnish servers in that much of the caching is duplicated, hence the need for the purges, and potentially a waste of cache space that may be reducing overall cache hit rates. [15:54:05] Too bad Collectd's varnish plugin is broken in the 5.7 branch available right now in the Ubuntu repos, so I don't currently have Varnish stats going to Graphite for easily monitoring hit rates. Need to address this issue. :/ [17:37:44] justinl: you may be interested in https://wikitech.wikimedia.org/wiki/MediaWiki_at_WMF [17:38:27] for Wikipedia, the varnish set up is also duplicated, but that's done intentionally so that it can handle more parallel load. There is also a second group of varnish behind the first one, where it is hash-distributed instead of round-robin. For that second group, the capacity is SUM() instead of 1. [17:38:40] logical storage capacity* [17:39:58] I'll take a look, thanks! I feel like I might've seen that page or something similar long ago but it's definitely got a lot of good info and food for thought regarding my redesign. [17:40:18] Krinkle: On an unrelated note: Thanks so much for all your recent triage and cleanup in Phabricator. Very appreciated. [17:41:51] :) [17:42:28] My conundrum is in how to keep my systems relatively simple (i.e. standard packages rather than custom builds, as reasonably simple a server/service architecture as I can manage, etc.) since the wikis are big but not wikipedia big, but they do have a bit of complex configuration. So I'm always working to simplify and streamline systems and [17:42:29] workflows while maintaining fast and efficient wikis. [17:44:09] justinl: yeah, rr vs hash for varnish I suppose is a judgement call. if you don't mind me asking, what's the main reason you have 4 as opposed to 1. Is it mainly reduncancy/failover? Or do you find in terms of handling incoming req rate that 1 can't handle it? [17:44:57] HA and lots of traffic: https://wikiapiary.com/wiki/Guild_Wars_2_Wiki_(en) [17:45:06] that's just one of 7 wikis on the 4 servers [17:46:00] 2 of them are tiny and for internal use, but the other three are GW1, which is big but doesn't have the same amount of traffic that it used to, and then GW2 German, French, and Spanish. [17:47:48] Four is enough where I can manage rolling server outages (e.g. patching/rebooting) where 3 servers can handle the load well enough while one is deregistered from its ALBs for maintenance [17:48:20] I've even thought about going to 6 just to spread them across 3 AWS availability zones just for paranoia's sake. [17:54:55] So my thought is to have three tiers of servers behind the load balancers: a set of Varnish servers that the ALBs hit; a set of Apache servers that Varnish can use as it sees fit; and a set of PHP-FPM servers for handling all of the MW PHP work plus for running a MW Job Queue Runner system. So while it's a slightly more complex system, it does [17:54:55] simplify certain things and allows for each type of server to be tuned best for its application. [18:00:45] hi everyone, I'm trying to find a way to query wiktionary.org using the api but can't seem to figure out how to select the language of the words returned [18:01:39] for instance, https://en.wiktionary.org/w/api.php?action=query&format=json&list=random returns a random word. Despite using the english URL I still see words in other languages appearing. [18:07:06] I think it just uses any word on the wiki [18:07:23] https://en.wiktionary.org/wiki/recuir%C3%A9e [18:07:58] That's because Wiktionary has no proper semantic structure and hence API for that, I guess. [18:08:16] Mm [18:14:01] Unode: all wiktionaries contain words in all languages [18:14:10] it's a wonderfully broken system [18:15:05] there have been some attempts to write a proper API, I don't think any of them are ready for everyday use [18:17:49] tgr: I figured that would be the case but was hoping to be able to somehow subset that. [18:18:17] Found a stackoverflow question from a while back asking the same. I guess the answer is no can do :( [18:18:24] thanks for the feedback everyone [19:49:02] Unode: basically you are looking for wiktionary entries which are in the category subtree of https://en.wiktionary.org/wiki/Category:English_language [19:49:27] you can try deepcat search, I doubt it will be useful at that scale though [19:50:23] the MediaWiki API has no concept of page language (one that would be relevant here, anyway) and very limited support for category-based retrieval