[01:30:50] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2215.codfw.wmnet` - mw2215.codfw.wmnet (**PASS**) - Downti... [01:35:54] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn) [05:08:11] 10serviceops, 10Code-Health-Objective, 10Performance-Team (Radar), 10Platform Team Initiatives (Session Management Service (CDP2)), and 2 others: Determine multi-dc strategy for CentralAuth - https://phabricator.wikimedia.org/T267270 (10tstarling) >>! In T267270#6903410, @tstarling wrote: > There are some... [07:03:24] 10serviceops, 10Code-Health-Objective, 10Patch-For-Review, 10Performance-Team (Radar), and 3 others: Determine multi-dc strategy for CentralAuth - https://phabricator.wikimedia.org/T267270 (10tstarling) Writing that patch forced me to properly review all current usages of session storage. Foreign API token... [09:24:55] 10serviceops, 10Wikidata, 10Wikidata-Termbox, 10wdwb-tech-focus: Missing alerts for Termbox staging and test services - https://phabricator.wikimedia.org/T276550 (10Jakob_WMDE) Ah, sorry about the incorrect "for months" qualifier. It looked like the termbox files hadn't been touched in a long time and I as... [11:18:12] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10akosiaris) [11:28:09] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [11:41:41] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [11:51:28] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [11:54:08] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [11:56:52] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10akosiaris) [11:58:14] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [11:58:56] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [11:59:36] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [12:06:04] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10akosiaris) [12:06:55] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [12:10:17] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10akosiaris) [14:11:42] 10serviceops, 10Add-Link, 10Growth-Team, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10kostajh) [14:12:54] 10serviceops, 10Add-Link, 10Growth-Team, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10kostajh) [14:23:35] 10serviceops, 10Add-Link, 10Growth-Team, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) Ah, so a simple for i in 1..10 do curl -s https://api.wikimedia.org/service/linkrecommendation... [14:26:31] 10serviceops, 10Add-Link, 10Growth-Team, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10kostajh) >>! In T277297#6908492, @akosiaris wrote: > Ah, so a simple > > for i in 1..10 > do > curl -s ht... [14:26:54] 10serviceops, 10Add-Link, 10Growth-Team, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10kostajh) [14:27:07] 10serviceops, 10Add-Link, 10Growth-Team, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10kostajh) [14:29:33] 10serviceops, 10Add-Link, 10Growth-Team, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) Ahem.. ` $ ab -n 100 -c 1 https://api.wikimedia.org/service/linkrecommendation/v0/linkrecommend... [14:45:39] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [14:47:51] 10serviceops, 10Add-Link, 10Growth-Team, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10akosiaris) Bypassing the api-gateway (and the services proxy) doesn't fix this in any way. Logs aren't he... [15:17:56] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [15:21:29] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 (10JMeybohm) [15:36:59] 10serviceops, 10Add-Link, 10Growth-Team, 10Platform Team Initiatives (API Gateway): 504 timeout and 503 errors when accessing linkrecommendation service - https://phabricator.wikimedia.org/T277297 (10kostajh) >>! In T277297#6908552, @akosiaris wrote: > Bypassing the api-gateway (and the services proxy) doe... [17:37:40] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Create restricted docker-registry namespace for security patched images - https://phabricator.wikimedia.org/T273521 (10dduvall) Success! I was able to build and push an image via docker-pusher on releases1... [17:37:53] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Create restricted docker-registry namespace for security patched images - https://phabricator.wikimedia.org/T273521 (10dduvall) 05Open→03Resolved [19:37:13] can anyone tell me how tags are created in the docker registry? if I create a tag for a repository in gerrit, should that trigger the creation of a tag in the registry? [19:49:00] for a canary server i set the weight for the "canary" service to 1 and the weights for api and app I leave at 30 if it's modern hardware. sounds right,r ight? [19:50:11] pretty sure, just thinking out loud what i'm doing [19:55:18] urandom: which image/repo is it? my understanding is that the deployment pipeline normally pushes images after each commit, not on tags [19:55:46] mediawiki/services/kask [19:56:13] legoktm: a tag is a commit tho, no? [19:56:43] I mean, it results in a new rev/checksum [19:56:49] https://gerrit.wikimedia.org/r/c/mediawiki/services/kask/+/656135 is the most recent commit, it says it was pushed using the tags 2021-01-15-153508-production, stable [19:57:31] yeah, I've since added a tag v1.0.7, and pushed that, with the expectation that I'd be able to use it as a docker tag [19:57:37] which has worked in the past [19:57:58] at least...there are a number of tags there that correspond with the git tags [19:58:15] hmm https://docker-registry.wikimedia.org/wikimedia/mediawiki-services-kask/tags/ [19:58:40] ya, see there are v1.0.0 - v1.0.6 [19:59:03] https://wikitech.wikimedia.org/wiki/Blubber/Pipeline "When you push a tag to your repository on Gerrit, Jenkins will, once again, build and execute your test variant. If that succeeds, Jenkins builds a Docker image based on your production Blubber variant, tags it with the tag you pushed and pushes that to the Wikimedia Docker Registry. " [19:59:36] oh there [19:59:38] it just pushed it [19:59:45] 2021-03-12-195445-production, stable [19:59:51] mutante: I am not sure I ever understood what the 'canary' service is for [19:59:58] I triggered a rebuild of the last commit [20:00:10] mutante: I think the other weights are what matters [20:00:49] oh, right, and it didn't have the 1.0.7 tag. I'm not sure then, maybe ask in -releng? [20:01:21] heh, I did... I'll give them a bit more time tho [20:02:42] legoktm: where did you see that 2021-03-12-195445-production, stable were pushed tho? [20:02:54] the most recent comment on https://gerrit.wikimedia.org/r/c/mediawiki/services/kask/+/656135 [20:03:08] auh [20:03:49] https://dockerregistry.toolforge.org/wikimedia/mediawiki-services-kask/tags/ must be delayed [20:04:31] effie: ACK, thank you, I am leaving it to 30 based on "new hardware" [20:04:52] :D [20:05:00] that tool has been supersceded by https://docker-registry.wikimedia.org/wikimedia/mediawiki-services-kask/tags/ which has a "Last updated" in the top right :) [20:05:02] and the canary=1 like on the others. i think j.oe said that weight doesnt matter [20:05:55] effie: and for the mcrouter proxy now suggesting 2299 but was going to merge it Monday https://gerrit.wikimedia.org/r/c/operations/puppet/+/670949 [20:06:16] just decom'ing some other old ones that are not proxy [20:06:22] legoktm: TIL :) [20:07:29] mutante: ack, tx tx [20:07:56] effie: fyi, what does the "canary" service change on a host: https://puppet-compiler.wmflabs.org/compiler1001/28548/mw2374.codfw.wmnet/index.html [20:08:05] interesting part .. listening on IPv6 , heh [20:08:37] mutante: I have applied this to all canary hosts [20:09:08] I will roll it out to all hosts next week :) [20:09:22] effie: aha! all is good if it's just ..canaries being used as canaries. cool [20:11:13] it fixes this issue: [20:11:15] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=20&orgId=1&from=1615557655000&to=1615561865000&var-server=mw1281&var-datasource=thanos&var-cluster=api_appserver [20:11:17] vs [20:11:29] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=20&orgId=1&from=1615557655000&to=1615561865000&var-server=mw1276&var-datasource=thanos&var-cluster=api_appserver [20:12:07] anyway, nothing that causes production problems anyway [20:13:13] *nod* great! yep [20:32:27] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2217.codfw.wmnet` - mw2217.codfw.wmnet (**PASS**) - Downti... [20:48:20] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2218.codfw.wmnet` - mw2218.codfw.wmnet (**PASS**) - Downti... [21:15:27] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2219.codfw.wmnet` - mw2219.codfw.wmnet (**PASS**) - Downti... [21:17:56] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn) p:05Triage→03High [21:18:14] 10serviceops, 10Patch-For-Review: decom 28 codfw appservers purchased on 2016-05-17 (mw2215 through mw2242) - https://phabricator.wikimedia.org/T277119 (10Dzahn) [21:20:49] 10serviceops, 10Parsoid (Third-party): parsoid apt repo rolled back breaks updates - https://phabricator.wikimedia.org/T264546 (10Dzahn) Bump to bump the version.. I guess. [21:34:37] 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: equivalent of scandium on buster? (upgrading parsoid testing servers) - https://phabricator.wikimedia.org/T268248 (10Dzahn) @ssastry What do you prefer, upgrading scandium in place so it keeps the same name or a new host with a new name that is on buste... [21:36:37] 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: equivalent of scandium on buster? (upgrading parsoid testing servers) - https://phabricator.wikimedia.org/T268248 (10Dzahn) Actually, since scandium is hardware and not virtual, reimaging in place would be a lot easier and the other option would involve... [21:37:15] 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: upgrade scandium to buster - https://phabricator.wikimedia.org/T268248 (10Dzahn) [22:03:21] 10serviceops, 10Parsoid-Tests, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) [22:04:02] 10serviceops, 10Parsoid-Tests, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) @ssastry Done! https://parsoid-rt-tests.wikimedia.org/ has been reactivated. It needed the parsoid-rt-tests.w... [22:05:44] 10serviceops, 10Parsoid-Tests, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) [22:06:33] 10serviceops, 10Parsoid-Tests, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) 05Open→03Resolved {F34156177} [23:06:02] 10serviceops, 10SRE, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) I run a test on mwdebug1001 where I switched on mcrouter its onhost memcached from plain to ssl: ` "onhost": { "servers": [... [23:40:33] 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: upgrade scandium to buster - https://phabricator.wikimedia.org/T268248 (10ssastry) Yes, reimaging in place works. I'll take a look at it later to see if there is any data there that needs saving.