2020-09-24 08:00:58
|
<_joe_>
|
effie, jayme is push-notifications not using the service proxy?
|
2020-09-24 08:01:10
|
<_joe_>
|
https://phabricator.wikimedia.org/T260247#6489959
|
2020-09-24 08:01:42
|
<wikibugs>
|
'serviceops, ''Push-Notification-Service, ''Patch-For-Review, ''Product-Infrastructure-Team-Backlog (Kanban): Push notification service should make deletion requests to the MW API for invalid or expired subscriptions - https://phabricator.wikimedia.org/T260247 (''Joe)'
|
2020-09-24 08:01:51
|
<jayme>
|
_joe_: No. The understanding was that it does not connecto to internal services
|
2020-09-24 08:02:25
|
<wikibugs>
|
'serviceops, ''Push-Notification-Service, ''Patch-For-Review, ''Product-Infrastructure-Team-Backlog (Kanban): Push notification service should make deletion requests to the MW API for invalid or expired subscriptions - https://phabricator.wikimedia.org/T260247 (''Joe) Or even better: it should use the ser...'
|
2020-09-24 08:02:26
|
<_joe_>
|
well it does :P
|
2020-09-24 08:02:33
|
<jayme>
|
obviously..
|
2020-09-24 08:02:33
|
<_joe_>
|
see above
|
2020-09-24 08:02:35
|
<effie>
|
I was under the same impression, let us discuss it with michael
|
2020-09-24 08:03:45
|
<_joe_>
|
the presence of main_app.mwapi_uri is quite telling imho
|
2020-09-24 08:46:05
|
<_joe_>
|
sigh there are still connections to ORES that are nto via https
|
2020-09-24 08:46:11
|
<_joe_>
|
goes spelunking
|
2020-09-24 08:50:07
|
<wikibugs>
|
'serviceops, ''User-jijiki: Improve the New Service Request documentation and template - https://phabricator.wikimedia.org/T263723 (''jijiki)'
|
2020-09-24 09:35:37
|
<_joe_>
|
hnowlan: do you have a task for SLOs for the API gateway?
|
2020-09-24 09:38:28
|
<wikibugs>
|
'serviceops, ''Operations, ''Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (''Joe)'
|
2020-09-24 09:38:39
|
<hnowlan>
|
_joe_: yep, T254916
|
2020-09-24 09:42:58
|
<wikibugs>
|
'serviceops, ''Operations, ''observability, ''Platform Team Initiatives (API Gateway): mtail 3.0.0-rc35 doesn't support the histogram type in -oneshot mode. - https://phabricator.wikimedia.org/T263728 (''Joe)'
|
2020-09-24 09:44:04
|
<wikibugs>
|
'serviceops, ''Operations, ''Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (''Joe)'
|
2020-09-24 09:51:45
|
<hnowlan>
|
Any ideas why helmfile is trying to sync changes for both a "production" and "staging" release in a service? regardless of whether it's a prod or a staging service changeprop-jobqueue in this case
|
2020-09-24 09:59:44
|
<jayme>
|
hnowlan: with "sync" it is trying to ensure state. So I think it just ensures hat there is no staging release present (e.g. it would remove existing ones from prod environments)
|
2020-09-24 10:04:22
|
<hnowlan>
|
jayme: ahh okay, the double !log for each sync made me think there was a misconfig. that makes sense
|
2020-09-24 10:13:27
|
<_joe_>
|
hnowlan: !log is something we still need to improve
|
2020-09-24 10:13:33
|
<_joe_>
|
contributions welcome :P
|
2020-09-24 10:49:44
|
<_joe_>
|
hnowlan: so... T263727 is the thing we need to do in order to obtain information about the rest api latencies
|
2020-09-24 11:04:10
|
<_joe_>
|
hnowlan: ugh I have a few issues with changeprop's config, we need to talk about it later
|
2020-09-24 11:04:23
|
<_joe_>
|
but basically, it should switch to use the https uri for ORES
|
2020-09-24 11:05:03
|
<_joe_>
|
and also for restbase :)
|
2020-09-24 11:05:08
|
<_joe_>
|
I'll send a patch I guess
|
2020-09-24 11:06:38
|
<hnowlan>
|
_joe_: oh, thanks for the ticket - I'll have a look
|
2020-09-24 11:06:46
|
<hnowlan>
|
interestingly I am debugging issues with changeprop-jobqueue right now
|
2020-09-24 11:07:15
|
<_joe_>
|
what issues?
|
2020-09-24 11:07:26
|
<_joe_>
|
maybe it's due to something we did
|
2020-09-24 11:07:35
|
<_joe_>
|
do you have a task?
|
2020-09-24 11:08:47
|
<_joe_>
|
uhmmm although I doubt it. It just calls the jobrunners, and that didn't change
|
2020-09-24 11:08:51
|
<hnowlan>
|
not at the moment - it's just the staging instance thankfully, but the instance fails to start workers and crashloops
|
2020-09-24 11:08:54
|
<hnowlan>
|
prod instances are fine
|
2020-09-24 11:09:05
|
<_joe_>
|
oh I see
|
2020-09-24 11:09:07
|
<_joe_>
|
ok
|
2020-09-24 11:09:19
|
<_joe_>
|
nice to see staging helps finding issues before they reach prod :)
|
2020-09-24 11:09:31
|
<_joe_>
|
anyways, we can talk later, I'm taking a break
|
2020-09-24 11:09:39
|
<hnowlan>
|
sounds good
|
2020-09-24 11:31:23
|
<hnowlan>
|
oh dear, nothing quite as complex as I suspected, just an OOM kill
|
2020-09-24 13:36:40
|
<ottomata>
|
hiya
|
2020-09-24 13:36:45
|
<ottomata>
|
how can I delete a single pod?
|
2020-09-24 13:36:47
|
<ottomata>
|
i used to do
|
2020-09-24 13:36:47
|
<ottomata>
|
https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#Delete_a_specific_k8s_pod
|
2020-09-24 13:37:07
|
<ottomata>
|
but now
|
2020-09-24 13:37:07
|
<ottomata>
|
Error from server (Forbidden): pods "eventstreams-canary-6df87cd67f-8qshm" is forbidden: User "eventstreams" cannot delete resource "pods" in API group "" in the namespace "eventstreams"
|
2020-09-24 13:40:30
|
<jayme>
|
ottomata: you need to be "cluster admin" to delete pods (deploy users don't have the right to do so
|
2020-09-24 13:41:11
|
<jayme>
|
ottomata: go for "sudo -i; kube_env admin <CLUSTER>; kubectl -n eventstreams foo bar"
|
2020-09-24 13:41:23
|
<ottomata>
|
ah k
|
2020-09-24 13:51:14
|
<ottomata>
|
jayme: ....you know what would be really useful?
|
2020-09-24 13:51:22
|
<ottomata>
|
some kind of logstash k8s service dashbaord
|
2020-09-24 13:51:29
|
<ottomata>
|
where you could select the service/app name from a drop down
|
2020-09-24 13:51:32
|
<ottomata>
|
and get all logs for your service
|
2020-09-24 13:51:35
|
<ottomata>
|
from the app
|
2020-09-24 13:51:48
|
<wikibugs>
|
'serviceops, ''Maps, ''Product-Infrastructure-Team-Backlog: [OSM] Install imposm3 in Maps master - https://phabricator.wikimedia.org/T238753 (''MSantos) >>! In T238753#5678746, @MoritzMuehlenhoff wrote: > Ah, that explains, it was removed from Debian in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=9326...'
|
2020-09-24 13:51:50
|
<ottomata>
|
...we don't have that already, do we?
|
2020-09-24 13:51:51
|
<jayme>
|
yeah! so nice of you that you want to create one :D
|
2020-09-24 13:51:54
|
<ottomata>
|
haha
|
2020-09-24 13:51:57
|
<ottomata>
|
i'll make a ticket
|
2020-09-24 13:52:02
|
<ottomata>
|
logstash is really hard
|
2020-09-24 13:54:31
|
<wikibugs>
|
'serviceops, ''Wikimedia-Logstash, ''Kubernetes: Create a logstash dashboard showing all application logs for a selected service - https://phabricator.wikimedia.org/T263755 (''Ottomata)'
|
2020-09-24 14:00:16
|
<cdanis>
|
ottomata: mood
|
2020-09-24 14:06:12
|
<wikibugs>
|
'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''Cmjohnson) a:''Cmjohnson→''RobH @robh power has been pulled and flea power drained'
|
2020-09-24 14:33:47
|
<wikibugs>
|
'serviceops, ''Wikimedia-Logstash, ''Kubernetes: Create a logstash dashboard showing all application logs for a selected service - https://phabricator.wikimedia.org/T263755 (''Ottomata) I guess this is only possible in logstash-next: https://www.elastic.co/blog/interactive-inputs-on-kibana-dashboards'
|
2020-09-24 14:38:36
|
<wikibugs>
|
'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` mw1360.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202009241438_robh_29643...'
|
2020-09-24 14:38:39
|
<wikibugs>
|
'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1360.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1360.eqiad.wmnet'] `'
|
2020-09-24 14:39:16
|
<wikibugs>
|
'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` mw1360.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202009241439_robh_30270...'
|
2020-09-24 15:00:00
|
<wikibugs>
|
'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1360.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1360.eqiad.wmnet'] `'
|
2020-09-24 15:00:13
|
<wikibugs>
|
'serviceops, ''Wikimedia-Logstash, ''Kubernetes: Create a logstash dashboard showing all application logs for a selected service - https://phabricator.wikimedia.org/T263755 (''Ottomata) Hm, perhaps: https://logstash-next.wikimedia.org/app/dashboards#/view/7f883390-fe76-11ea-b848-090a7444f26c?_g=(filters%3A...'
|
2020-09-24 15:22:53
|
<wikibugs>
|
'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''RobH) Host is now online (reimaged) and returned to service post scap pull and repool. Set to active in netbox. However, its not in the DSH node groups, and the directions aren't clear on where th...'
|
2020-09-24 15:27:31
|
<addshore>
|
o/ hi all, What are the current hosting options for "static sites" that may have a build step in them before they are ready to be "static"?
|
2020-09-24 15:28:27
|
<addshore>
|
and also, I remember talking in the past about trying to get things like the query service UI (for wikidata) deployed via blubber etc so that we (wmde) could do our own deployments. Is that still "kind of okay"? or are there more options now?
|
2020-09-24 15:31:08
|
<_joe_>
|
addshore: we're all in a meeting sorry
|
2020-09-24 15:31:13
|
<addshore>
|
ack! np!
|
2020-09-24 15:54:03
|
<wikibugs>
|
'serviceops, ''Operations, ''Recommendation-API: recommendation-api alerting and api errors - https://phabricator.wikimedia.org/T262587 (''crusnov) p:''Triage→''Medium'
|
2020-09-24 16:00:17
|
<wikibugs>
|
'serviceops, ''Wikidata, ''Wikidata-Termbox: Termbox service: unusual errors that could be from envoy - https://phabricator.wikimedia.org/T263764 (''Tarrow)'
|
2020-09-24 16:02:53
|
<wikibugs>
|
'serviceops, ''Operations, ''observability: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (''crusnov)'
|
2020-09-24 16:11:22
|
<wikibugs>
|
'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''RobH) After checkign with @Joe via irc, it seems this should automatically be added back into DSH and clear after the puppet run and repooling, but has not. All other checks green, but I'd like to...'
|
2020-09-24 16:13:24
|
<wikibugs>
|
'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''Volans) It looks it's marked as inactive on conftool: ` $ confctl select 'name=mw1360.eqiad.wmnet' get {"mw1360.eqiad.wmnet": {"weight": 30, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=api_apps...'
|
2020-09-24 16:30:50
|
<wikibugs>
|
'serviceops, ''Operations, ''User-WDoran, ''User-brennen: Canaries canaries canaries - https://phabricator.wikimedia.org/T210143 (''brennen)'
|
2020-09-24 16:36:06
|
<wikibugs>
|
'serviceops, ''Operations, ''ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (''RobH) ''Open→''Resolved Ok, all is now green for the host in icinga and it shows in pooled/in service state.'
|
2020-09-24 16:46:56
|
<wikibugs>
|
'serviceops, ''Operations, ''ops-codfw: decom wtp2005 (was: wtp2005 hardware issue) - https://phabricator.wikimedia.org/T257903 (''Papaul) ''Open→''Resolved Disks removed from server and unrack'
|
2020-09-24 18:20:17
|
<wikibugs>
|
'serviceops, ''Operations, ''Product-Infrastructure-Team-Backlog, ''Wikifeeds, and 2 others: Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (''Pchelolo)'
|
2020-09-24 19:47:34
|
<wikibugs>
|
'serviceops, ''Operations, ''Parsing-Team, ''Platform Engineering, and 5 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (''daniel)'
|
2020-09-24 19:47:54
|
<wikibugs>
|
'serviceops, ''Operations, ''Parsing-Team, ''TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (''daniel)'
|
2020-09-24 19:48:40
|
<wikibugs>
|
'serviceops, ''Operations, ''Parsing-Team, ''TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (''daniel) a:''holger.knust→''None'
|
2020-09-24 19:52:29
|
<bd808>
|
Is there a plan to make a buster + python3.7 base image? The current docker-registry.wikimedia.org/python3 is 3.5 + stretch.
|
2020-09-24 19:53:37
|
<wikibugs>
|
'serviceops, ''Operations, ''Parsing-Team, ''TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (''daniel) Looking into this some more, we came across a number of issues, namely: * Diffs and permalinks don...'
|
2020-09-24 23:56:09
|
<wikibugs>
|
'serviceops, ''MW-on-K8s, ''Operations, ''TechCom-RFC, ''Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (''tstarling)'
|
2020-09-25 05:50:41
|
<_joe_>
|
bd808: no one did because no one needed it. Patches or tasks welcome :P
|
2020-09-25 06:32:08
|
<wikibugs>
|
'serviceops, ''MW-on-K8s, ''Operations, ''TechCom-RFC, ''Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (''tstarling) I uploaded the Score changes to give you an idea of what a moderately complex caller looks like in practice.
It's l...'
|
2020-09-25 06:53:16
|
<wikibugs>
|
'serviceops, ''Operations, ''ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (''elukey) Added a week of downtime, sorry for the powercycle :('
|
2020-09-25 07:44:44
|
<addshore>
|
_joe_: ping re those questions that I asked yesterday :) IF this isn't be best place to ask / i should target someone specifically let me know!
|
2020-09-25 07:45:35
|
<_joe_>
|
addshore: I'm supposed to be off today, so if my partner catches me here, we'll pretend we're talking about football
|
2020-09-25 07:45:58
|
<_joe_>
|
but: I would suppose we could just have a service for all kinds of static content
|
2020-09-25 07:46:35
|
<addshore>
|
aaah, your thinking maybe 1 k8s service somehow for a bunch of different sites / content?
|
2020-09-25 07:46:36
|
<_joe_>
|
a helm chart for an nginx with some configuration
|
2020-09-25 07:46:54
|
<_joe_>
|
yes, it's only static stuff, there is no point in having many
|
2020-09-25 07:47:32
|
<addshore>
|
thats a interesting idea, I wonder how that would end up tying together with blubber and all
|
2020-09-25 07:47:41
|
<_joe_>
|
now, it's trivial to do I think, but if you do it correctly, we can just make the chart smart enough that we can re-use the same nginx image
|
2020-09-25 07:47:51
|
<_joe_>
|
oh I was thinking you don't need blubber at all
|
2020-09-25 07:48:07
|
<addshore>
|
thats also fine! (no blubber)
|
2020-09-25 07:48:09
|
<_joe_>
|
and you could mount the static content as volumes in k8s
|
2020-09-25 07:48:25
|
<_joe_>
|
but it can also be built into the image
|
2020-09-25 07:48:35
|
<addshore>
|
Right, and then have a similar sort of process for the static content, build it, put it in some repo somewhere?
|
2020-09-25 07:48:35
|
<_joe_>
|
let's hear what others think too, though
|
2020-09-25 07:49:15
|
<_joe_>
|
yeah so, either we just abuse blubber, create a repo called wikimedia-static-content and the build is just a nginx image + the content of the repo
|
2020-09-25 07:49:24
|
<addshore>
|
I like the idea, my main question then is what would be update process be for content, and would we (wmde) easily be able to do it?
|
2020-09-25 07:49:28
|
<_joe_>
|
I keep saying nginx but assume it's any webserver
|
2020-09-25 07:49:56
|
<_joe_>
|
so, thinking about it
|
2020-09-25 07:50:17
|
<_joe_>
|
probably the pre-built image is the best idea, so using blubber
|
2020-09-25 07:50:22
|
<_joe_>
|
to make your life easier
|
2020-09-25 07:50:55
|
<_joe_>
|
but! we will need a dedicated subdomain for this I assume?
|
2020-09-25 07:51:01
|
<addshore>
|
So, some service called "static content" which has a repo with a blubber file that builds an image with the content in it?
|
2020-09-25 07:51:15
|
<_joe_>
|
and the nginx or apache config
|
2020-09-25 07:51:18
|
<addshore>
|
ack
|
2020-09-25 07:52:13
|
<addshore>
|
So, 1 usecase for this would be the UI currently at query.wikidata.org (but query.wikidata.org/sparql etc still need to point at the wdqs hoses).
|
2020-09-25 07:52:13
|
<addshore>
|
A second usecase and site of static content would be something like query.wikidata.org/magicplace
|
2020-09-25 07:52:53
|
<addshore>
|
I guess these could get handled by the nginx service running on the wdqs servers then? (could also be done at a different level, but I dont know about that)
|
2020-09-25 07:52:55
|
<_joe_>
|
uhhh so you would need to split by url at the edge? that's a big no-no
|
2020-09-25 07:53:28
|
<addshore>
|
^^ yeah, that sounded hard, so I didnt think about that
|
2020-09-25 07:53:40
|
<_joe_>
|
tbh, for your specific problem, I would assume having a repo with the static content and just deploy it with scap3 to the wdqs nodes would make more sense
|
2020-09-25 07:53:59
|
<addshore>
|
so, feedback from the wdqs folks was that ideally we wouldnt be deploying more stuff there
|
2020-09-25 07:54:31
|
<addshore>
|
and ideally we would even remove the UI from there, and also put it somewhere that we (wmde) can have more control over getting updates deployed there
|
2020-09-25 07:54:31
|
<_joe_>
|
ok, so here's the thing: your repo will need to contain the webserver config too, not just the static content
|
2020-09-25 07:54:57
|
<_joe_>
|
I would expect we'd do internet => varnish => wdqs frontend => wdqs
|
2020-09-25 07:55:03
|
<_joe_>
|
just for sparql queries
|
2020-09-25 07:55:33
|
<addshore>
|
"wdqs frontend" being this new static frontend bit?
|
2020-09-25 07:55:35
|
<_joe_>
|
let's see what jayme and akosiaris think about this too :)
|
2020-09-25 07:55:37
|
<_joe_>
|
yes
|
2020-09-25 07:55:50
|
<addshore>
|
okay, yeah, this all sounds fairly sane
|
2020-09-25 07:56:16
|
<addshore>
|
One thing that jumps to mind then is that wdqs sparql queries depend on this static content service (not sure if that is a good or bad thing)
|
2020-09-25 07:56:18
|
<_joe_>
|
also means we can add, in the future, some form of intelligence/filtering to what we throw back to sparql
|
2020-09-25 07:56:33
|
<_joe_>
|
well just for the public endpoint
|
2020-09-25 07:56:43
|
<addshore>
|
_joe_: haha you mean something like this https://github.com/wmde/queripulator ? ;)
|
2020-09-25 07:56:59
|
<_joe_>
|
I already love the name
|
2020-09-25 07:57:05
|
<addshore>
|
xD
|
2020-09-25 07:57:14
|
<_joe_>
|
although to be deployed in production it will need to follow our naming convention
|
2020-09-25 07:57:20
|
<_joe_>
|
so queripulatoroid
|
2020-09-25 07:57:38
|
<addshore>
|
amazing, I like the general sound of all of this, I'm going to head off for 15 mins, but would enjoy to continue chatting in this direction.
|
2020-09-25 07:57:43
|
<addshore>
|
I might also try and write some of this up
|
2020-09-25 07:57:43
|
<_joe_>
|
actually, yes, I can see the now-static site
|
2020-09-25 07:57:55
|
<_joe_>
|
add queripulator in the future :)
|
2020-09-25 07:58:22
|
<_joe_>
|
addshore: also discuss this with the discovery team folks at WMF who also do some work on WDQS
|
2020-09-25 07:58:44
|
<_joe_>
|
I'm not going to be around later sorry, as I said I'm on PTO today ;)
|
2020-09-25 07:58:54
|
<addshore>
|
np! go and enjoy your friday off!
|
2020-09-25 07:59:11
|
<addshore>
|
yes, I'll write something concise and then go and talk to discovery a little more (I was chatting with them yesterday on this topic too)
|
2020-09-25 08:15:25
|
<wikibugs>
|
'serviceops, ''Operations, ''Traffic: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (''ema)'
|
2020-09-25 08:27:21
|
<wikibugs>
|
'serviceops, ''Operations, ''Traffic: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (''ema)'
|
2020-09-25 08:36:19
|
<akosiaris>
|
o/
|
2020-09-25 08:37:30
|
<jayme>
|
o/
|
2020-09-25 08:37:36
|
<addshore>
|
o/
|
2020-09-25 09:13:45
|
<jayme>
|
_joe_: regarding the discussion about static sites in k8s I had something similar with mu.tante in my first month (he has a ton of static content hostet on some VMs as well). The issues arising there where a) (obv.) getting traffic to it without needing an LVS per static site. b) Having some kind of easy pipeline for "not-so-tech-folks" to handle content updates, releases etc.
|
2020-09-25 09:16:03
|
<jayme>
|
Maybe it would be smart to come up with something that is reuseable for that as well. I was also thinking at some point: Can we server static stuff from swift? Only hosting some kind of server in k8s? That could maybe remove the burden of building images from the people managing the static content
|
2020-09-25 09:59:59
|
<addshore>
|
^^ that sounds fun
|
2020-09-25 10:00:45
|
<addshore>
|
Keeping the content out of the images sounds like a bonus so you done have to rebuild everything for a 1 line fix on one site for example
|
2020-09-25 10:00:49
|
<addshore>
|
*dont
|
2020-09-25 10:05:05
|
<jayme>
|
akosiaris: I see you're looking into the termbox thing as well. What stands out is that the lowered error rate mentioned in https://phabricator.wikimedia.org/T255410#6488465 looks like a direct result of switching the mw api calls to use envoy (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/622580)
|
2020-09-25 10:07:34
|
<jayme>
|
the logstash links do work for me btw. (I get the error messages as well, but the document shown to me looks like the right one)
|
2020-09-25 10:07:51
|
<akosiaris>
|
jayme: https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=All&var-origin_instance=All&var-destination=termbox might also be useful. I am just starting to look into it, I am still not clear on which direction the errors are
|
2020-09-25 10:08:05
|
<akosiaris>
|
remember that it's mediawiki -> termbox -> mediawiki IIRC
|
2020-09-25 10:09:31
|
<akosiaris>
|
jayme: yeah, I 've seen it be at times and at times not be what was expected. Hence following the safer approach
|
2020-09-25 10:09:54
|
<jayme>
|
ack
|
2020-09-25 10:11:15
|
<jayme>
|
[2020-09-24T14:19:17.887Z] "GET /w/index.php?format=json&title=Special:EntityData&id=Q26762157&revision=1138265532 HTTP/1.1" 503 UC 0 95 0 - "-" "wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.18.1" "88f5fde8-87bc-49d8-9804-30a212bcb780" "www.wikidata.org:6500"; "10.2.1.22:443"
|
2020-09-25 10:12:21
|
<jayme>
|
doesn't this say 503 UC while connecting "upstream" which in this case is www.wikidata.org:6500 which is localhost:6500 which is api-rw.discovery.wmnet?
|
2020-09-25 10:13:01
|
<jayme>
|
and as this is logged from envoy sidecar in termbox it is the termbox -> mediawiki part of the chain?
|
2020-09-25 10:30:28
|
<jayme>
|
akosiaris: so when I get that right, they had like a ton of timeouts calling https://api-ro.discovery.wmnet/w/index.php until envoy and now they get (way less) UC 503 from envoy instead, no longer seeing the timeouts in termbox itself, right?
|
2020-09-25 10:31:33
|
<jayme>
|
One thing is with envoy we route them to the rw-api for some reason (don't know if that makes a difference) and the other thing is the errors seen how could very well have been there all the time hiding in the timeouts
|
2020-09-25 10:48:18
|
<akosiaris>
|
could be...
|
2020-09-25 10:49:55
|
<tarrow>
|
o/
|
2020-09-25 10:50:21
|
<tarrow>
|
any questions about the above feel free to fire them at me
|
2020-09-25 10:51:51
|
<tarrow>
|
I think we also *suspect* that the host header should not actually be set to `www.wikidata.org:6500` but instead plain `www.wikidata.org. Although clearly this is working most of the time
|
2020-09-25 10:58:16
|
<akosiaris>
|
that's not the host header. It's the authority HTTP2 header and the port is fine there per https://tools.ietf.org/html/rfc7540#section-8.1.2.3
|
2020-09-25 11:00:16
|
<akosiaris>
|
note that per RFC 3986, authority = [ userinfo "@" ] host [ ":" port ] (you need this knowledge too to understand that part of rfc7540
|
2020-09-25 11:02:00
|
<akosiaris>
|
the envoy access logging default format (we haven't yet diverged from it) is in https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage
|
2020-09-25 11:16:00
|
<tarrow>
|
So this `GET /w/index.php?format=json&title=Special:EntityData&id=Q26762157&revision=1138265532 HTTP/1.1" 503 UC 0 95 0 - "-" "wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.18.1" "88f5fde8-87bc-49d8-9804-30a212bcb780" "www.wikidata.org:6500"; "10.2.1.22:443"` isn't the host header? even though it's HTTP 1.1 ?
|
2020-09-25 11:16:46
|
<tarrow>
|
reads up a little on these http fundamentals
|
2020-09-25 11:19:07
|
<akosiaris>
|
tarrow: btw even in HTTP/1.1 the Host header MUST have the port if it differs from the expected one for the protocol https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.23
|
2020-09-25 11:20:20
|
<akosiaris>
|
on the move
|
2020-09-25 11:38:40
|
<wikibugs>
|
'serviceops, ''Operations, ''Wikidata, ''Wikidata-Termbox, ''User-Addshore: Plan to scale up termbox service to be able to render the termbox for desktop pageviews - https://phabricator.wikimedia.org/T261486 (''Addshore)'
|
2020-09-25 11:46:51
|
<wikibugs>
|
'serviceops, ''Operations, ''Wikidata, ''Wikidata-Termbox, ''User-Addshore: Plan to scale up termbox service to be able to render the termbox for desktop pageviews - https://phabricator.wikimedia.org/T261486 (''Addshore)'
|
2020-09-25 13:46:18
|
<jayme>
|
akosiaris: _joe_: I tried to adapt the envoy telemetry dashboard to kubernetes envoys. Please take a look if that makes sense to you https://grafana.wikimedia.org/d/b1jttnFMz/jayme-envoy-telemetry-k8s
|
2020-09-25 14:07:37
|
<akosiaris>
|
jayme: I 've added https://grafana.wikimedia.org/d/b1jttnFMz/jayme-envoy-telemetry-k8s?viewPanel=17&orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=termbox&var-destination=All&from=1600954758430&to=1600960133933
|
2020-09-25 14:08:06
|
<akosiaris>
|
that's a pretty low error rate btw
|
2020-09-25 14:08:13
|
<jayme>
|
ah, great
|
2020-09-25 14:08:14
|
<akosiaris>
|
not sure if it's worth to investigate this a lot
|
2020-09-25 14:08:26
|
<jayme>
|
yeah...+1
|
2020-09-25 14:09:03
|
<jayme>
|
But the dashboard will be of use anyways I guess. :-)
|
2020-09-25 14:09:06
|
<akosiaris>
|
if one zooms out to 2days it appears to be a bit often
|
2020-09-25 14:09:23
|
<akosiaris>
|
but always fixing itself pretty quickly
|
2020-09-25 14:09:33
|
<akosiaris>
|
yeah, the dashboard is pretty useful
|
2020-09-25 14:09:57
|
<akosiaris>
|
I was thinking of copying some of these panels in the services dashboard themselves (some dedicated row or so)
|
2020-09-25 14:10:41
|
<jayme>
|
Still trying to make sense og the first two latency graphs (all of that is copied from the original envoy dashboard). I think they don't make any sense as they sum over all up/downstreams
|
2020-09-25 14:11:05
|
<jayme>
|
Do you have an idea if there was a point behind that?
|
2020-09-25 14:12:37
|
<jayme>
|
thought about maybe excluding the admin up/downstreams in general as wel...
|
2020-09-25 14:12:48
|
<akosiaris>
|
yeah, probably not making sense as they are
|
2020-09-25 14:13:27
|
<akosiaris>
|
ah, perhaps just giving a very quick feel of what is going on
|
2020-09-25 14:14:06
|
<akosiaris>
|
yeah, that was probably my idea. Use the first graph for a quick rough idea and then use the other ones per endpoint/destination
|
2020-09-25 14:14:27
|
<jayme>
|
but mixing admin in there does not make sense then as well, does it?
|
2020-09-25 14:14:36
|
<akosiaris>
|
true, admin should be excluded
|
2020-09-25 14:14:48
|
<jayme>
|
I mean that is health checks etc. and probably always pretty fast
|
2020-09-25 14:14:56
|
<jayme>
|
ok. going to exclude
|
2020-09-25 14:24:24
|
<jayme>
|
thanks for review akosiaris, I've moved the dashboard to the General folder now and removed my user scope: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s
|
2020-09-25 14:28:57
|
<_joe_>
|
wow mobileapps alone makes 700 req/s to the mw api
|
2020-09-25 14:42:34
|
<akosiaris>
|
on the move
|
2020-09-25 15:25:53
|
<jayme>
|
Also added upsteam connection details now: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&from=1600959600000&to=1600974000000&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=citoid&var-destination=All
|
2020-09-25 16:17:37
|
<wikibugs>
|
'serviceops, ''OTRS, ''Operations, ''Patch-For-Review, ''User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (''jcrespo) db1077 should now be available to be put back on test-* section, I don't think it is needed anymore as an m2 (otrs test)
host. @M...'
|
2020-09-25 18:08:33
|
<wikibugs>
|
'serviceops, ''Operations, ''Traffic, ''Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (''hashar)'
|