[00:40:23] do we serve originals active-active for reads from ATS-BE to Swift? [00:40:25] T204245 [00:40:26] T204245: Run MediaWiki media originals active/active - https://phabricator.wikimedia.org/T204245 [00:40:34] This task being open suggests we don't [00:41:19] I imagine it might intra-dc banwdith (if not already) if those were routed to nearest core DC instead of primary DC. [00:41:29] might reduce* [00:46:52] Trying to answer my own question: swift-ro and swift-rw don't exist anymore. Which suggests it's just one thing now, at least for ATS->swift::proxy, and I doubt we'd have gone back to serving thumbs from primary only. [00:47:11] swift.discovery is claimed by swift::proxy, and profile::swift::proxyhosts includes hosts in both DCs [00:47:55] Krinkle: yes, the swift.discovery.wmnet service is what ATS uses (https://gerrit.wikimedia.org/g/operations/puppet/+/refs/heads/production/hieradata/common/profile/trafficserver/backend.yaml#405) and indeed that's active/active [00:49:12] is that defined in DNS with LVS to ms-fe pool? [00:50:59] it's a combination of state on operations/puppet and in operations/dns. in the former, the service-catalog is usually a good place to look: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/service.yaml#2662 [00:51:54] or, maybe a better way of describing it is that puppet contains the intent, and then operations/dns wires it in differently depending on whether a given service is a/a vs. a/p [00:52:55] I'm not finding the step between swift.discovery.wmnet and LVS. The dns code uses a template referring to "disc-swift" but I can't seem to find where that's defined. [00:53:04] OK, I think I found it. https://gerrit.wikimedia.org/g/operations/puppet/+/bcf2b5d8616b1407f022e140793616f563b31dd6/modules/profile/templates/dns/auth/discovery-geo-resources.erb#4 [00:53:42] failed: https://codesearch.wmcloud.org/search/?q=disc-swift&files=&excludeFiles=&repos= [00:53:43] success: https://codesearch.wmcloud.org/search/?q=%5Cbdisc-%5Cb&files=pp%7Clvs%7Cservice%7Csvc%7Cswift%7Cdns%7Cpybal&excludeFiles=&repos= [00:54:31] thx :) [00:56:32] exactly, yeah: that's where the LVS IPs in the service catalog get wired into resources that are then referenced in gdnsd configuration (along with some dynamic state that depends on etcd) :) [00:56:54] huh, swift-rw existed still until very recently, ref T376237 [00:56:56] T376237: Turn down unused swift-r[ow] discovery services - https://phabricator.wikimedia.org/T376237 [00:57:01] noticed your commit in dns.git removing them [00:57:08] I expected that reverse blame to hit a much older patch [00:58:13] indeed! they were a persistent source of confusion, since they were not actually in use, so we decided to get rid of them and stick with the a/a swift.discovery.wmnet being canonical [00:59:08] right, we do have a primary dc for writing to swift from MW perspective (maybe?) and the reconsiliation program, but neither uses swift-rw to guide them [00:59:29] * Krinkle closes T204245 [01:00:34] * swfrench-wmf thumbs up [01:01:23] my vague recollection is that we wire the LVS service names into, e.g., ProductionServices.php [01:01:42] ah, yeah - that seems to be the case (i.e., differentiated by DC) [01:02:36] right, we write through ms-fe, not directly to ms-be [01:02:53] I don't think I knew that. I assumed ms-fe was just our swift proxy. [01:02:57] for generating thumbs on 404 [06:28:01] <_joe_> Krinkle: yeah that's was just our usual terrible phab hygene; thanks for closing that task :) [09:22:27] oh my.. I'm on clinic duty [09:22:41] and I just noticed it 😱 [09:23:06] so I'm 1 week behind [09:23:08] sorry folks :( [09:27:29] did your IRC client not ping you every 8 hours when the topic changed? :) [09:32:06] Emperor: for starters I was OoO Thursday and Friday and probably missed the pings being at the hospital [09:32:24] and TBH I don't like what you're suggesting [09:48:04] vgutierrez: sorry, didn't mean to imply anything bad on your part. Just theway my IRC client is setup slightly swamps me in highlights [10:10:54] I don't know if this might be helpful, but in my case it is. If you go to Splunk On-Call and open your user profile, there's an ICS link under 'Personal Calendar' that you can import into your Google Calendar. It's a bit noisy because it also exports the batphone schedule, but I find it convenient. [11:21:38] GitLab needs a short maintenance break in one hour [11:59:43] This afternoon we're going to deploy a patch for Thanos to introduce a dedicated Store Gateway for the Ruler component. This will enable us to backfill blocks related to recording rules while keeping the current cutoff threshold to query fresh data from Prometheus sidecars instead of directly from the object store. [11:59:52] Moreover, it will shorten the window during which a Thanos host needs to be considered stateful from 15 days to 2 hours, thereby simplifying maintenance operations. It will also protect against certain behaviors we noticed, such as transient gaps in graphs caused by the non-concurrent timing between when the Ruler stops providing data and when the Store Gateway starts. [11:59:57] We will achieve this through a smooth rollout, disabling Puppet and depooling hosts one by one. [12:00:13] During the deployment, the Querier component on each Titan host will be able to reach only the Ruler in the same datacenter. [12:00:17] While this change does not affect Grafana graphs or new blocks generated by the Ruler, it will prevent any inconsistencies generated by the new setup on a given Titan host from propagating to blocks generated by hosts still running the old configuration. [12:00:28] We’ll let you know when we start. If you notice any issues during or after the rollout, please reach out to us in the observability channel. Thanks. [12:29:42] GitLab maintenance finished [13:22:47] > We’ll let you know when we start [13:23:37] We're starting the deployment now [16:20:38] is it already known that https://www.wikimediastatus.net/ is missing data / isn't updating? [16:20:54] i'm seeing 503s on enwiki too, fwiw [16:20:58] yeah thanks known [16:21:14] 👍 [16:24:25] legoktm: https://phabricator.wikimedia.org/T418381 (in addition to other issues) [16:25:04] ty! [16:32:47] fyi. Got a notification from speedtest.org that wikipeida.org is down (~60 reports) around switzerland (loading is kinda slow for me) [16:32:58] dcaro: thanks, known [16:33:06] 👍 [16:39:11] (single datapoint) works for me now [17:36:32] A quick update on the Thanos activities: the initial groundwork has been completed, but it took longer than expected. We'll continue tomorrow. As of now, Thanos is working as it was this morning, except that each Querier is querying only the Ruler in the same DC.