Fork me on GitHub

Wikimedia IRC logs browser - #wikimedia-serviceops

Filter:
Start date
End date

Displaying 210 items:

2020-09-24 08:00:58 <_joe_> effie, jayme is push-notifications not using the service proxy?
2020-09-24 08:01:10 <_joe_> https://phabricator.wikimedia.org/T260247#6489959
2020-09-24 08:01:42 <wikibugs> 'serviceops, ''Push-Notification-Service, ''Patch-For-Review, ''Product-Infrastructure-Team-Backlog (Kanban): Push notification service should make deletion requests to the MW API for invalid or expired subscriptions - https://phabricator.wikimedia.org/T260247 (''Joe)'
2020-09-24 08:01:51 <jayme> _joe_: No. The understanding was that it does not connecto to internal services
2020-09-24 08:02:25 <wikibugs> 'serviceops, ''Push-Notification-Service, ''Patch-For-Review, ''Product-Infrastructure-Team-Backlog (Kanban): Push notification service should make deletion requests to the MW API for invalid or expired subscriptions - https://phabricator.wikimedia.org/T260247 (''Joe) Or even better: it should use the ser...'
2020-09-24 08:02:26 <_joe_> well it does :P
2020-09-24 08:02:33 <jayme> obviously..
2020-09-24 08:02:33 <_joe_> see above
2020-09-24 08:02:35 <effie> I was under the same impression, let us discuss it with michael
2020-09-24 08:03:45 <_joe_> the presence of main_app.mwapi_uri is quite telling imho
2020-09-24 08:46:05 <_joe_> sigh there are still connections to ORES that are nto via https
2020-09-24 08:46:11 <_joe_> goes spelunking
2020-09-24 08:50:07 <wikibugs> 'serviceops, ''User-jijiki: Improve the New Service Request documentation and template - https://phabricator.wikimedia.org/T263723 (''jijiki)'
2020-09-24 09:35:37 <_joe_> hnowlan: do you have a task for SLOs for the API gateway?
2020-09-24 09:38:28 <wikibugs> 'serviceops, ''Operations, ''Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (''Joe)'
2020-09-24 09:38:39 <hnowlan> _joe_: yep, T254916
2020-09-24 09:42:58 <wikibugs> 'serviceops, ''Operations, ''observability, ''Platform Team Initiatives (API Gateway): mtail 3.0.0-rc35 doesn't support the histogram type in -oneshot mode. - https://phabricator.wikimedia.org/T263728 (''Joe)'
2020-09-24 09:44:04 <wikibugs> 'serviceops, ''Operations, ''Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (''Joe)'
2020-09-24 09:51:45 <hnowlan> Any ideas why helmfile is trying to sync changes for both a "production" and "staging" release in a service? regardless of whether it's a prod or a staging service changeprop-jobqueue in this case
2020-09-24 09:59:44 <jayme> hnowlan: with "sync" it is trying to ensure state. So I think it just ensures hat there is no staging release present (e.g. it would remove existing ones from prod environments)
2020-09-24 10:04:22 <hnowlan> jayme: ahh okay, the double !log for each sync made me think there was a misconfig. that makes sense
2020-09-24 10:13:27 <_joe_> hnowlan: !log is something we still need to improve
2020-09-24 10:13:33 <_joe_> contributions welcome :P
2020-09-24 10:49:44 <_joe_> hnowlan: so... T263727 is the thing we need to do in order to obtain information about the rest api latencies
2020-09-24 11:04:10 <_joe_> hnowlan: ugh I have a few issues with changeprop's config, we need to talk about it later
2020-09-24 11:04:23 <_joe_> but basically, it should switch to use the https uri for ORES
2020-09-24 11:05:03 <_joe_> and also for restbase :)
2020-09-24 11:05:08 <_joe_> I'll send a patch I guess
2020-09-24 11:06:38 <hnowlan> _joe_: oh, thanks for the ticket - I'll have a look
2020-09-24 11:06:46 <hnowlan> interestingly I am debugging issues with changeprop-jobqueue right now
2020-09-24 11:07:15 <_joe_> what issues?
2020-09-24 11:07:26 <_joe_> maybe it's due to something we did
2020-09-24 11:07:35 <_joe_> do you have a task?
2020-09-24 11:08:47 <_joe_> uhmmm although I doubt it. It just calls the jobrunners, and that didn't change
2020-09-24 11:08:51 <hnowlan> not at the moment - it's just the staging instance thankfully, but the instance fails to start workers and crashloops
2020-09-24 11:08:54 <hnowlan> prod instances are fine
2020-09-24 11:09:05 <_joe_> oh I see
2020-09-24 11:09:07 <_joe_> ok
2020-09-24 11:09:19 <_joe_> nice to see staging helps finding issues before they reach prod :)
2020-09-24 11:09:31 <_joe_> anyways, we can talk later, I'm taking a break
2020-09-24 11:09:39 <hnowlan> sounds good
2020-09-24 11:31:23 <hnowlan> oh dear, nothing quite as complex as I suspected, just an OOM kill
2020-09-24 13:36:40 <ottomata> hiya
2020-09-24 13:36:45 <ottomata> how can I delete a single pod?
2020-09-24 13:36:47 <ottomata> i used to do
2020-09-24 13:36:47 <ottomata> https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#Delete_a_specific_k8s_pod
2020-09-24 13:37:07 <ottomata> but now
2020-09-24 13:37:07 <ottomata> Error from server (Forbidden): pods "eventstreams-canary-6df87cd67f-8qshm" is forbidden: User "eventstreams" cannot delete resource "pods" in API group "" in the namespace "eventstreams"
2020-09-24 13:40:30 <jayme> ottomata: you need to be "cluster admin" to delete pods (deploy users don't have the right to do so
2020-09-24 13:41:11 <jayme> ottomata: go for "sudo -i; kube_env admin <CLUSTER>; kubectl -n eventstreams foo bar"
2020-09-24 13:41:23 <ottomata> ah k
2020-09-24 13:51:14 <ottomata> jayme: ....you know what would be really useful?
2020-09-24 13:51:22 <ottomata> some kind of logstash k8s service dashbaord
2020-09-24 13:51:29 <ottomata> where you could select the service/app name from a drop down
2020-09-24 13:51:32 <ottomata> and get all logs for your service
2020-09-24 13:51:35 <ottomata> from the app
2020-09-24 13:51:48 <wikibugs> 'serviceops, ''Maps, ''Product-Infrastructure-Team-Backlog: [OSM] Install imposm3 in Maps master - https://phabricator.wikimedia.org/T238753 (''MSantos) >>! In T238753#5678746, @MoritzMuehlenhoff wrote: > Ah, that explains, it was removed from Debian in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=9326...'
2020-09-24 13:51:50 <ottomata> ...we don't have that already, do we?
2020-09-24 13:51:51 <jayme> yeah! so nice of you that you want to create one :D
2020-09-24 13:51:54 <ottomata> haha
2020-09-24 13:51:57 <ottomata> i'll make a ticket
2020-09-24 13:52:02 <ottomata> logstash is really hard
2020-09-24 13:54:31 <wikibugs> 'serviceops, ''Wikimedia-Logstash, ''Kubernetes: Create a logstash dashboard showing all application logs for a selected service - https://phabricator.wikimedia.org/T263755 (''Ottomata)'
2020-09-24 14:00:16 <cdanis> ottomata: mood
2020-09-24 14:06:12 <wikibugs> 'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''Cmjohnson) a:''Cmjohnson''RobH @robh power has been pulled and flea power drained'
2020-09-24 14:33:47 <wikibugs> 'serviceops, ''Wikimedia-Logstash, ''Kubernetes: Create a logstash dashboard showing all application logs for a selected service - https://phabricator.wikimedia.org/T263755 (''Ottomata) I guess this is only possible in logstash-next: https://www.elastic.co/blog/interactive-inputs-on-kibana-dashboards'
2020-09-24 14:38:36 <wikibugs> 'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` mw1360.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202009241438_robh_29643...'
2020-09-24 14:38:39 <wikibugs> 'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1360.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1360.eqiad.wmnet'] `'
2020-09-24 14:39:16 <wikibugs> 'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` mw1360.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202009241439_robh_30270...'
2020-09-24 15:00:00 <wikibugs> 'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1360.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1360.eqiad.wmnet'] `'
2020-09-24 15:00:13 <wikibugs> 'serviceops, ''Wikimedia-Logstash, ''Kubernetes: Create a logstash dashboard showing all application logs for a selected service - https://phabricator.wikimedia.org/T263755 (''Ottomata) Hm, perhaps: https://logstash-next.wikimedia.org/app/dashboards#/view/7f883390-fe76-11ea-b848-090a7444f26c?_g=(filters%3A...'
2020-09-24 15:22:53 <wikibugs> 'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''RobH) Host is now online (reimaged) and returned to service post scap pull and repool. Set to active in netbox. However, its not in the DSH node groups, and the directions aren't clear on where th...'
2020-09-24 15:27:31 <addshore> o/ hi all, What are the current hosting options for "static sites" that may have a build step in them before they are ready to be "static"?
2020-09-24 15:28:27 <addshore> and also, I remember talking in the past about trying to get things like the query service UI (for wikidata) deployed via blubber etc so that we (wmde) could do our own deployments. Is that still "kind of okay"? or are there more options now?
2020-09-24 15:31:08 <_joe_> addshore: we're all in a meeting sorry
2020-09-24 15:31:13 <addshore> ack! np!
2020-09-24 15:54:03 <wikibugs> 'serviceops, ''Operations, ''Recommendation-API: recommendation-api alerting and api errors - https://phabricator.wikimedia.org/T262587 (''crusnov) p:''Triage''Medium'
2020-09-24 16:00:17 <wikibugs> 'serviceops, ''Wikidata, ''Wikidata-Termbox: Termbox service: unusual errors that could be from envoy - https://phabricator.wikimedia.org/T263764 (''Tarrow)'
2020-09-24 16:02:53 <wikibugs> 'serviceops, ''Operations, ''observability: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (''crusnov)'
2020-09-24 16:11:22 <wikibugs> 'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''RobH) After checkign with @Joe via irc, it seems this should automatically be added back into DSH and clear after the puppet run and repooling, but has not. All other checks green, but I'd like to...'
2020-09-24 16:13:24 <wikibugs> 'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''Volans) It looks it's marked as inactive on conftool: ` $ confctl select 'name=mw1360.eqiad.wmnet' get {"mw1360.eqiad.wmnet": {"weight": 30, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=api_apps...'
2020-09-24 16:30:50 <wikibugs> 'serviceops, ''Operations, ''User-WDoran, ''User-brennen: Canaries canaries canaries - https://phabricator.wikimedia.org/T210143 (''brennen)'
2020-09-24 16:36:06 <wikibugs> 'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''RobH) ''Open''Resolved Ok, all is now green for the host in icinga and it shows in pooled/in service state.'
2020-09-24 16:46:56 <wikibugs> 'serviceops, ''Operations, ''ops-codfw: decom wtp2005 (was: wtp2005 hardware issue) - https://phabricator.wikimedia.org/T257903 (''Papaul) ''Open''Resolved Disks removed from server and unrack'
2020-09-24 18:20:17 <wikibugs> 'serviceops, ''Operations, ''Product-Infrastructure-Team-Backlog, ''Wikifeeds, and 2 others: Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (''Pchelolo)'
2020-09-24 19:47:34 <wikibugs> 'serviceops, ''Operations, ''Parsing-Team, ''Platform Engineering, and 5 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (''daniel)'
2020-09-24 19:47:54 <wikibugs> 'serviceops, ''Operations, ''Parsing-Team, ''TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (''daniel)'
2020-09-24 19:48:40 <wikibugs> 'serviceops, ''Operations, ''Parsing-Team, ''TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (''daniel) a:''holger.knust''None'
2020-09-24 19:52:29 <bd808> Is there a plan to make a buster + python3.7 base image? The current docker-registry.wikimedia.org/python3 is 3.5 + stretch.
2020-09-24 19:53:37 <wikibugs> 'serviceops, ''Operations, ''Parsing-Team, ''TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (''daniel) Looking into this some more, we came across a number of issues, namely: * Diffs and permalinks don...'
2020-09-24 23:56:09 <wikibugs> 'serviceops, ''MW-on-K8s, ''Operations, ''TechCom-RFC, ''Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (''tstarling)'
2020-09-25 05:50:41 <_joe_> bd808: no one did because no one needed it. Patches or tasks welcome :P
2020-09-25 06:32:08 <wikibugs> 'serviceops, ''MW-on-K8s, ''Operations, ''TechCom-RFC, ''Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (''tstarling) I uploaded the Score changes to give you an idea of what a moderately complex caller looks like in practice. It's l...'
2020-09-25 06:53:16 <wikibugs> 'serviceops, ''Operations, ''ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (''elukey) Added a week of downtime, sorry for the powercycle :('
2020-09-25 07:44:44 <addshore> _joe_: ping re those questions that I asked yesterday :) IF this isn't be best place to ask / i should target someone specifically let me know!
2020-09-25 07:45:35 <_joe_> addshore: I'm supposed to be off today, so if my partner catches me here, we'll pretend we're talking about football
2020-09-25 07:45:58 <_joe_> but: I would suppose we could just have a service for all kinds of static content
2020-09-25 07:46:35 <addshore> aaah, your thinking maybe 1 k8s service somehow for a bunch of different sites / content?
2020-09-25 07:46:36 <_joe_> a helm chart for an nginx with some configuration
2020-09-25 07:46:54 <_joe_> yes, it's only static stuff, there is no point in having many
2020-09-25 07:47:32 <addshore> thats a interesting idea, I wonder how that would end up tying together with blubber and all
2020-09-25 07:47:41 <_joe_> now, it's trivial to do I think, but if you do it correctly, we can just make the chart smart enough that we can re-use the same nginx image
2020-09-25 07:47:51 <_joe_> oh I was thinking you don't need blubber at all
2020-09-25 07:48:07 <addshore> thats also fine! (no blubber)
2020-09-25 07:48:09 <_joe_> and you could mount the static content as volumes in k8s
2020-09-25 07:48:25 <_joe_> but it can also be built into the image
2020-09-25 07:48:35 <addshore> Right, and then have a similar sort of process for the static content, build it, put it in some repo somewhere?
2020-09-25 07:48:35 <_joe_> let's hear what others think too, though
2020-09-25 07:49:15 <_joe_> yeah so, either we just abuse blubber, create a repo called wikimedia-static-content and the build is just a nginx image + the content of the repo
2020-09-25 07:49:24 <addshore> I like the idea, my main question then is what would be update process be for content, and would we (wmde) easily be able to do it?
2020-09-25 07:49:28 <_joe_> I keep saying nginx but assume it's any webserver
2020-09-25 07:49:56 <_joe_> so, thinking about it
2020-09-25 07:50:17 <_joe_> probably the pre-built image is the best idea, so using blubber
2020-09-25 07:50:22 <_joe_> to make your life easier
2020-09-25 07:50:55 <_joe_> but! we will need a dedicated subdomain for this I assume?
2020-09-25 07:51:01 <addshore> So, some service called "static content" which has a repo with a blubber file that builds an image with the content in it?
2020-09-25 07:51:15 <_joe_> and the nginx or apache config
2020-09-25 07:51:18 <addshore> ack
2020-09-25 07:52:13 <addshore> So, 1 usecase for this would be the UI currently at query.wikidata.org (but query.wikidata.org/sparql etc still need to point at the wdqs hoses).
2020-09-25 07:52:13 <addshore> A second usecase and site of static content would be something like query.wikidata.org/magicplace
2020-09-25 07:52:53 <addshore> I guess these could get handled by the nginx service running on the wdqs servers then? (could also be done at a different level, but I dont know about that)
2020-09-25 07:52:55 <_joe_> uhhh so you would need to split by url at the edge? that's a big no-no
2020-09-25 07:53:28 <addshore> ^^ yeah, that sounded hard, so I didnt think about that
2020-09-25 07:53:40 <_joe_> tbh, for your specific problem, I would assume having a repo with the static content and just deploy it with scap3 to the wdqs nodes would make more sense
2020-09-25 07:53:59 <addshore> so, feedback from the wdqs folks was that ideally we wouldnt be deploying more stuff there
2020-09-25 07:54:31 <addshore> and ideally we would even remove the UI from there, and also put it somewhere that we (wmde) can have more control over getting updates deployed there
2020-09-25 07:54:31 <_joe_> ok, so here's the thing: your repo will need to contain the webserver config too, not just the static content
2020-09-25 07:54:57 <_joe_> I would expect we'd do internet => varnish => wdqs frontend => wdqs
2020-09-25 07:55:03 <_joe_> just for sparql queries
2020-09-25 07:55:33 <addshore> "wdqs frontend" being this new static frontend bit?
2020-09-25 07:55:35 <_joe_> let's see what jayme and akosiaris think about this too :)
2020-09-25 07:55:37 <_joe_> yes
2020-09-25 07:55:50 <addshore> okay, yeah, this all sounds fairly sane
2020-09-25 07:56:16 <addshore> One thing that jumps to mind then is that wdqs sparql queries depend on this static content service (not sure if that is a good or bad thing)
2020-09-25 07:56:18 <_joe_> also means we can add, in the future, some form of intelligence/filtering to what we throw back to sparql
2020-09-25 07:56:33 <_joe_> well just for the public endpoint
2020-09-25 07:56:43 <addshore> _joe_: haha you mean something like this https://github.com/wmde/queripulator ? ;)
2020-09-25 07:56:59 <_joe_> I already love the name
2020-09-25 07:57:05 <addshore> xD
2020-09-25 07:57:14 <_joe_> although to be deployed in production it will need to follow our naming convention
2020-09-25 07:57:20 <_joe_> so queripulatoroid
2020-09-25 07:57:38 <addshore> amazing, I like the general sound of all of this, I'm going to head off for 15 mins, but would enjoy to continue chatting in this direction.
2020-09-25 07:57:43 <addshore> I might also try and write some of this up
2020-09-25 07:57:43 <_joe_> actually, yes, I can see the now-static site
2020-09-25 07:57:55 <_joe_> add queripulator in the future :)
2020-09-25 07:58:22 <_joe_> addshore: also discuss this with the discovery team folks at WMF who also do some work on WDQS
2020-09-25 07:58:44 <_joe_> I'm not going to be around later sorry, as I said I'm on PTO today ;)
2020-09-25 07:58:54 <addshore> np! go and enjoy your friday off!
2020-09-25 07:59:11 <addshore> yes, I'll write something concise and then go and talk to discovery a little more (I was chatting with them yesterday on this topic too)
2020-09-25 08:15:25 <wikibugs> 'serviceops, ''Operations, ''Traffic: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (''ema)'
2020-09-25 08:27:21 <wikibugs> 'serviceops, ''Operations, ''Traffic: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (''ema)'
2020-09-25 08:36:19 <akosiaris> o/
2020-09-25 08:37:30 <jayme> o/
2020-09-25 08:37:36 <addshore> o/
2020-09-25 09:13:45 <jayme> _joe_: regarding the discussion about static sites in k8s I had something similar with mu.tante in my first month (he has a ton of static content hostet on some VMs as well). The issues arising there where a) (obv.) getting traffic to it without needing an LVS per static site. b) Having some kind of easy pipeline for "not-so-tech-folks" to handle content updates, releases etc.
2020-09-25 09:16:03 <jayme> Maybe it would be smart to come up with something that is reuseable for that as well. I was also thinking at some point: Can we server static stuff from swift? Only hosting some kind of server in k8s? That could maybe remove the burden of building images from the people managing the static content
2020-09-25 09:59:59 <addshore> ^^ that sounds fun
2020-09-25 10:00:45 <addshore> Keeping the content out of the images sounds like a bonus so you done have to rebuild everything for a 1 line fix on one site for example
2020-09-25 10:00:49 <addshore> *dont
2020-09-25 10:05:05 <jayme> akosiaris: I see you're looking into the termbox thing as well. What stands out is that the lowered error rate mentioned in https://phabricator.wikimedia.org/T255410#6488465 looks like a direct result of switching the mw api calls to use envoy (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/622580)
2020-09-25 10:07:34 <jayme> the logstash links do work for me btw. (I get the error messages as well, but the document shown to me looks like the right one)
2020-09-25 10:07:51 <akosiaris> jayme: https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=All&var-origin_instance=All&var-destination=termbox might also be useful. I am just starting to look into it, I am still not clear on which direction the errors are
2020-09-25 10:08:05 <akosiaris> remember that it's mediawiki -> termbox -> mediawiki IIRC
2020-09-25 10:09:31 <akosiaris> jayme: yeah, I 've seen it be at times and at times not be what was expected. Hence following the safer approach
2020-09-25 10:09:54 <jayme> ack
2020-09-25 10:11:15 <jayme> [2020-09-24T14:19:17.887Z] "GET /w/index.php?format=json&title=Special:EntityData&id=Q26762157&revision=1138265532 HTTP/1.1" 503 UC 0 95 0 - "-" "wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.18.1" "88f5fde8-87bc-49d8-9804-30a212bcb780" "www.wikidata.org:6500"; "10.2.1.22:443"
2020-09-25 10:12:21 <jayme> doesn't this say 503 UC while connecting "upstream" which in this case is www.wikidata.org:6500 which is localhost:6500 which is api-rw.discovery.wmnet?
2020-09-25 10:13:01 <jayme> and as this is logged from envoy sidecar in termbox it is the termbox -> mediawiki part of the chain?
2020-09-25 10:30:28 <jayme> akosiaris: so when I get that right, they had like a ton of timeouts calling https://api-ro.discovery.wmnet/w/index.php until envoy and now they get (way less) UC 503 from envoy instead, no longer seeing the timeouts in termbox itself, right?
2020-09-25 10:31:33 <jayme> One thing is with envoy we route them to the rw-api for some reason (don't know if that makes a difference) and the other thing is the errors seen how could very well have been there all the time hiding in the timeouts
2020-09-25 10:48:18 <akosiaris> could be...
2020-09-25 10:49:55 <tarrow> o/
2020-09-25 10:50:21 <tarrow> any questions about the above feel free to fire them at me
2020-09-25 10:51:51 <tarrow> I think we also *suspect* that the host header should not actually be set to `www.wikidata.org:6500` but instead plain `www.wikidata.org. Although clearly this is working most of the time
2020-09-25 10:58:16 <akosiaris> that's not the host header. It's the authority HTTP2 header and the port is fine there per https://tools.ietf.org/html/rfc7540#section-8.1.2.3
2020-09-25 11:00:16 <akosiaris> note that per RFC 3986, authority = [ userinfo "@" ] host [ ":" port ] (you need this knowledge too to understand that part of rfc7540
2020-09-25 11:02:00 <akosiaris> the envoy access logging default format (we haven't yet diverged from it) is in https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage
2020-09-25 11:16:00 <tarrow> So this `GET /w/index.php?format=json&title=Special:EntityData&id=Q26762157&revision=1138265532 HTTP/1.1" 503 UC 0 95 0 - "-" "wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.18.1" "88f5fde8-87bc-49d8-9804-30a212bcb780" "www.wikidata.org:6500"; "10.2.1.22:443"` isn't the host header? even though it's HTTP 1.1 ?
2020-09-25 11:16:46 <tarrow> reads up a little on these http fundamentals
2020-09-25 11:19:07 <akosiaris> tarrow: btw even in HTTP/1.1 the Host header MUST have the port if it differs from the expected one for the protocol https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.23
2020-09-25 11:20:20 <akosiaris> on the move
2020-09-25 11:38:40 <wikibugs> 'serviceops, ''Operations, ''Wikidata, ''Wikidata-Termbox, ''User-Addshore: Plan to scale up termbox service to be able to render the termbox for desktop pageviews - https://phabricator.wikimedia.org/T261486 (''Addshore)'
2020-09-25 11:46:51 <wikibugs> 'serviceops, ''Operations, ''Wikidata, ''Wikidata-Termbox, ''User-Addshore: Plan to scale up termbox service to be able to render the termbox for desktop pageviews - https://phabricator.wikimedia.org/T261486 (''Addshore)'
2020-09-25 13:46:18 <jayme> akosiaris: _joe_: I tried to adapt the envoy telemetry dashboard to kubernetes envoys. Please take a look if that makes sense to you https://grafana.wikimedia.org/d/b1jttnFMz/jayme-envoy-telemetry-k8s
2020-09-25 14:07:37 <akosiaris> jayme: I 've added https://grafana.wikimedia.org/d/b1jttnFMz/jayme-envoy-telemetry-k8s?viewPanel=17&orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=termbox&var-destination=All&from=1600954758430&to=1600960133933
2020-09-25 14:08:06 <akosiaris> that's a pretty low error rate btw
2020-09-25 14:08:13 <jayme> ah, great
2020-09-25 14:08:14 <akosiaris> not sure if it's worth to investigate this a lot
2020-09-25 14:08:26 <jayme> yeah...+1
2020-09-25 14:09:03 <jayme> But the dashboard will be of use anyways I guess. :-)
2020-09-25 14:09:06 <akosiaris> if one zooms out to 2days it appears to be a bit often
2020-09-25 14:09:23 <akosiaris> but always fixing itself pretty quickly
2020-09-25 14:09:33 <akosiaris> yeah, the dashboard is pretty useful
2020-09-25 14:09:57 <akosiaris> I was thinking of copying some of these panels in the services dashboard themselves (some dedicated row or so)
2020-09-25 14:10:41 <jayme> Still trying to make sense og the first two latency graphs (all of that is copied from the original envoy dashboard). I think they don't make any sense as they sum over all up/downstreams
2020-09-25 14:11:05 <jayme> Do you have an idea if there was a point behind that?
2020-09-25 14:12:37 <jayme> thought about maybe excluding the admin up/downstreams in general as wel...
2020-09-25 14:12:48 <akosiaris> yeah, probably not making sense as they are
2020-09-25 14:13:27 <akosiaris> ah, perhaps just giving a very quick feel of what is going on
2020-09-25 14:14:06 <akosiaris> yeah, that was probably my idea. Use the first graph for a quick rough idea and then use the other ones per endpoint/destination
2020-09-25 14:14:27 <jayme> but mixing admin in there does not make sense then as well, does it?
2020-09-25 14:14:36 <akosiaris> true, admin should be excluded
2020-09-25 14:14:48 <jayme> I mean that is health checks etc. and probably always pretty fast
2020-09-25 14:14:56 <jayme> ok. going to exclude
2020-09-25 14:24:24 <jayme> thanks for review akosiaris, I've moved the dashboard to the General folder now and removed my user scope: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s
2020-09-25 14:28:57 <_joe_> wow mobileapps alone makes 700 req/s to the mw api
2020-09-25 14:42:34 <akosiaris> on the move
2020-09-25 15:25:53 <jayme> Also added upsteam connection details now: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&from=1600959600000&to=1600974000000&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=citoid&var-destination=All
2020-09-25 16:17:37 <wikibugs> 'serviceops, ''OTRS, ''Operations, ''Patch-For-Review, ''User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (''jcrespo) db1077 should now be available to be put back on test-* section, I don't think it is needed anymore as an m2 (otrs test) host. @M...'
2020-09-25 18:08:33 <wikibugs> 'serviceops, ''Operations, ''Traffic, ''Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (''hashar)'

This page is generated from SQL logs, you can also download static txt files from here