[00:06:40] 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201907), 10Scap, 10serviceops: 'scap pull' stopped working on appservers ? - https://phabricator.wikimedia.org/T228328 (10Dzahn) the above was after: 19:50 < mutante> !log built new scap version 3.11.1-1 on boron, copied to... [00:07:01] thcipriani: new scap is built and published, installed on mw2250 and works:) i'll leave the roll-out on all servers dia debdeploy for tomorrow [00:08:18] mutante: cool. Glad it works for the servers that were giving you problems at least. [00:08:57] thcipriani: yes, thank you [00:17:29] 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201907), 10Scap, 10serviceops: Deploy scap 3.11.1-1 - https://phabricator.wikimedia.org/T228482 (10Dzahn) built, published, deployed and tested on mw2250. just needs to be rolled out across the cluster with debdeploy now. [00:17:47] 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201907), 10Scap, 10serviceops: Deploy scap 3.11.1-1 - https://phabricator.wikimedia.org/T228482 (10Dzahn) a:03Dzahn [00:18:30] 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201907), 10Scap, 10serviceops: 'scap pull' stopped working on appservers ? - https://phabricator.wikimedia.org/T228328 (10Dzahn) p:05High→03Normal the new scap version fixes this issue on mw2250. scap pull works there ag... [00:32:14] (03PS1) 10Legoktm: [SecureLinkFixer] Enable phan & seccheck [integration/config] - 10https://gerrit.wikimedia.org/r/524388 [04:52:22] 10Release-Engineering-Team-TODO (201907), 10Operations, 10serviceops, 10Patch-For-Review, 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) I did a complete pull of all images and tags of our... [04:52:30] 10Release-Engineering-Team-TODO (201907), 10Operations, 10serviceops, 10Patch-For-Review, 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) p:05High→03Normal [05:01:38] 10Project-Admins, 10Core Platform Team, 10Performance-Team: Narrow scope of MediaWiki-Database workboard - https://phabricator.wikimedia.org/T228360 (10Marostegui) [06:50:05] RECOVERY - Free space - all mounts on deployment-fluorine02 is OK: OK: All targets OK [07:22:24] 10Gerrit, 10Gerrit-Privilege-Requests, 10Wikimedia-IEG-grant-review: Give access to l10n-bot to wikimedia/iegreview repository. - https://phabricator.wikimedia.org/T228490 (10abi_) [07:22:37] 10Gerrit, 10Gerrit-Privilege-Requests, 10Wikimedia-IEG-grant-review: Give access to l10n-bot to wikimedia/iegreview repository - https://phabricator.wikimedia.org/T228490 (10abi_) [07:46:07] 10Phabricator (Upstream), 10Upstream: Create "User-Alaa" project - https://phabricator.wikimedia.org/T228491 (10alaa_wmde) [08:13:10] 10Phabricator (Upstream), 10Upstream: Create "User-Alaa" project - https://phabricator.wikimedia.org/T228491 (10alaa_wmde) 05Open→03Resolved a:03alaa_wmde https://phabricator.wikimedia.org/project/board/4184/ [08:13:14] 10Phabricator (Upstream), 10Upstream: Per-user projects for personal work in progress tracking - https://phabricator.wikimedia.org/T555 (10alaa_wmde) [08:47:33] 10Beta-Cluster-Infrastructure, 10Editing-team, 10Release Pipeline, 10serviceops, and 2 others: Migrate Beta cluster services to use Kubernetes - https://phabricator.wikimedia.org/T220235 (10akosiaris) 05Open→03Resolved a:03akosiaris Agreed with @Krenair, closing for now. [08:58:21] 10Project-Admins: Create "User-Alaa" project - https://phabricator.wikimedia.org/T228491 (10Peachey88) [09:02:15] (03CR) 10Hashar: [C: 03+2] "Passed on https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/SecureLinkFixer/+/524383/ :)" [integration/config] - 10https://gerrit.wikimedia.org/r/524388 (owner: 10Legoktm) [09:05:36] (03Merged) 10jenkins-bot: [SecureLinkFixer] Enable phan & seccheck [integration/config] - 10https://gerrit.wikimedia.org/r/524388 (owner: 10Legoktm) [09:13:35] 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO, 10Operations, 10Patch-For-Review: Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) >>! In T184086#5345246, @fgiunchedi wrote: > Sort of orthogonal, please consider also ad... [09:34:29] (03PS2) 10Awight: Clean-up: split out run_selenium [integration/quibble] - 10https://gerrit.wikimedia.org/r/524189 [09:34:31] (03PS17) 10Awight: Run node browser tests in each repo [integration/quibble] - 10https://gerrit.wikimedia.org/r/510709 (https://phabricator.wikimedia.org/T199116) [09:34:33] (03PS1) 10Awight: Detect selenium tests by parsing composer.json [integration/quibble] - 10https://gerrit.wikimedia.org/r/524477 [09:34:37] ^ Making it more fun to review [10:09:10] 10Continuous-Integration-Infrastructure, 10Operations, 10Packaging: PCC always has an ERROR when compiling for servers with profile::redis::slave - https://phabricator.wikimedia.org/T228266 (10jbond) [10:10:22] (03CR) 10Hashar: [C: 03+2] "Good thank you :)" [integration/quibble] - 10https://gerrit.wikimedia.org/r/521515 (https://phabricator.wikimedia.org/T87781) (owner: 10Kosta Harlan) [10:11:11] (03Merged) 10jenkins-bot: Don't run phpunit-unit stage if the composer script doesn't exist [integration/quibble] - 10https://gerrit.wikimedia.org/r/521515 (https://phabricator.wikimedia.org/T87781) (owner: 10Kosta Harlan) [10:11:44] (03CR) 10jenkins-bot: Don't run phpunit-unit stage if the composer script doesn't exist [integration/quibble] - 10https://gerrit.wikimedia.org/r/521515 (https://phabricator.wikimedia.org/T87781) (owner: 10Kosta Harlan) [10:12:43] (03CR) 10Hashar: [C: 03+2] Clean-up: split out run_selenium [integration/quibble] - 10https://gerrit.wikimedia.org/r/524189 (owner: 10Awight) [10:13:21] (03Merged) 10jenkins-bot: Clean-up: split out run_selenium [integration/quibble] - 10https://gerrit.wikimedia.org/r/524189 (owner: 10Awight) [10:13:52] (03CR) 10jenkins-bot: Clean-up: split out run_selenium [integration/quibble] - 10https://gerrit.wikimedia.org/r/524189 (owner: 10Awight) [10:21:57] (03CR) 10Hashar: [C: 03+2] "I like the idea of detecting based on composer.json, that seems reliable :-]" [integration/quibble] - 10https://gerrit.wikimedia.org/r/524477 (owner: 10Awight) [10:22:03] (03CR) 10jerkins-bot: [V: 04-1] Detect selenium tests by parsing composer.json [integration/quibble] - 10https://gerrit.wikimedia.org/r/524477 (owner: 10Awight) [10:22:36] (03PS2) 10Hashar: Detect selenium tests by parsing composer.json [integration/quibble] - 10https://gerrit.wikimedia.org/r/524477 (owner: 10Awight) [10:22:38] (03PS1) 10Hashar: Use repo_has_composer_script() in PhpUnitUnit [integration/quibble] - 10https://gerrit.wikimedia.org/r/524490 [10:23:03] (03CR) 10Hashar: "The helper comes from https://gerrit.wikimedia.org/r/#/c/integration/quibble/+/524477/ :)" [integration/quibble] - 10https://gerrit.wikimedia.org/r/524490 (owner: 10Hashar) [10:23:46] (03CR) 10Hashar: [C: 03+2] Detect selenium tests by parsing composer.json [integration/quibble] - 10https://gerrit.wikimedia.org/r/524477 (owner: 10Awight) [10:24:31] (03Merged) 10jenkins-bot: Detect selenium tests by parsing composer.json [integration/quibble] - 10https://gerrit.wikimedia.org/r/524477 (owner: 10Awight) [10:25:14] (03CR) 10jenkins-bot: Detect selenium tests by parsing composer.json [integration/quibble] - 10https://gerrit.wikimedia.org/r/524477 (owner: 10Awight) [10:45:14] 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO, 10Operations, 10Patch-For-Review: Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10fgiunchedi) >>! In T184086#5348933, @hashar wrote: >>>! In T184086#5345246, @fgiunchedi wrote: >... [11:00:38] (03PS1) 10MarcoAurelio: Adjust L10-n bot permissions [wikimedia/iegreview] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/524493 (https://phabricator.wikimedia.org/T228490) [11:01:04] (03CR) 10Awight: [C: 03+2] "Improved!" [integration/quibble] - 10https://gerrit.wikimedia.org/r/524490 (owner: 10Hashar) [11:01:23] (03PS18) 10Awight: Run node browser tests in each repo [integration/quibble] - 10https://gerrit.wikimedia.org/r/510709 (https://phabricator.wikimedia.org/T199116) [11:01:45] (03Merged) 10jenkins-bot: Use repo_has_composer_script() in PhpUnitUnit [integration/quibble] - 10https://gerrit.wikimedia.org/r/524490 (owner: 10Hashar) [11:02:15] (03CR) 10jenkins-bot: Use repo_has_composer_script() in PhpUnitUnit [integration/quibble] - 10https://gerrit.wikimedia.org/r/524490 (owner: 10Hashar) [11:19:02] (03CR) 10Awight: [C: 03+1] Context manager to time stuff [integration/quibble] - 10https://gerrit.wikimedia.org/r/503125 (owner: 10Hashar) [11:31:04] 10Gerrit: Error 500 when visiting the 'dashboards' tab for some repos - https://phabricator.wikimedia.org/T228505 (10MarcoAurelio) [11:31:45] 10Gerrit: Error 500 when visiting the 'dashboards' tab for some repos - https://phabricator.wikimedia.org/T228505 (10MarcoAurelio) [12:57:15] (03PS1) 10Hashar: Allow force wmf to force push dashboards [wikimedia] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/524506 (https://phabricator.wikimedia.org/T228505) [12:57:22] (03CR) 10Hashar: [V: 03+2 C: 03+2] Allow force wmf to force push dashboards [wikimedia] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/524506 (https://phabricator.wikimedia.org/T228505) (owner: 10Hashar) [13:00:36] (03PS1) 10Hashar: Revert "Allow force wmf to force push dashboards" [wikimedia] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/524507 (https://phabricator.wikimedia.org/T228505) [13:00:47] (03CR) 10Hashar: [V: 03+2 C: 03+2] Revert "Allow force wmf to force push dashboards" [wikimedia] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/524507 (https://phabricator.wikimedia.org/T228505) (owner: 10Hashar) [13:02:00] (03CR) 10Hashar: [V: 03+2 C: 03+2] Adjust L10-n bot permissions [wikimedia/iegreview] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/524493 (https://phabricator.wikimedia.org/T228490) (owner: 10MarcoAurelio) [13:02:08] (03CR) 10Hashar: [V: 03+2 C: 03+2] "Thank you :]" [wikimedia/iegreview] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/524493 (https://phabricator.wikimedia.org/T228490) (owner: 10MarcoAurelio) [13:47:28] reminder that I'm going to start breaking deployment-prep in a few minutes [14:11:15] PROBLEM - Host deployment-eventlog05 is DOWN: CRITICAL - Host Unreachable (172.16.4.128) [14:11:22] PROBLEM - Host Generic Beta Cluster is DOWN: CRITICAL - Host Unreachable (en.wikipedia.beta.wmflabs.org) [14:12:36] PROBLEM - Host deployment-elastic06 is DOWN: CRITICAL - Host Unreachable (172.16.5.131) [14:13:38] PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (172.16.5.141) [14:14:45] RECOVERY - Host Generic Beta Cluster is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [14:22:35] RECOVERY - Host deployment-elastic06 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [14:25:46] PROBLEM - Host deployment-chromium02 is DOWN: CRITICAL - Host Unreachable (172.16.4.14) [14:27:51] forgot completely, off my instances now [14:28:11] RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [14:28:19] RECOVERY - Host deployment-chromium02 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [14:30:46] RECOVERY - Host deployment-eventlog05 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [14:34:14] PROBLEM - Host deployment-imagescaler01 is DOWN: CRITICAL - Host Unreachable (172.16.5.80) [14:34:31] PROBLEM - Host deployment-imagescaler02 is DOWN: CRITICAL - Host Unreachable (172.16.5.41) [14:34:32] Project beta-scap-eqiad build #258506: 04FAILURE in 4.6 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/258506/ [14:35:16] PROBLEM - Host deployment-dumps-puppetmaster02 is DOWN: CRITICAL - Host Unreachable (172.16.4.101) [14:36:03] PROBLEM - Host deployment-etcd-01 is DOWN: CRITICAL - Host Unreachable (172.16.5.46) [14:36:34] RECOVERY - Host deployment-dumps-puppetmaster02 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [14:41:42] RECOVERY - Host deployment-imagescaler02 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [14:43:07] RECOVERY - Host deployment-etcd-01 is UP: PING OK - Packet loss = 0%, RTA = 2.23 ms [14:44:15] RECOVERY - Host deployment-imagescaler01 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [14:47:18] Project beta-scap-eqiad build #258507: 04STILL FAILING in 2 min 55 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/258507/ [14:47:54] PROBLEM - Host deployment-ircd is DOWN: CRITICAL - Host Unreachable (172.16.5.68) [14:48:00] PROBLEM - Host deployment-kafka-jumbo-1 is DOWN: CRITICAL - Host Unreachable (172.16.5.4) [14:48:50] PROBLEM - Host deployment-jobrunner03 is DOWN: CRITICAL - Host Unreachable (172.16.4.98) [14:51:46] PROBLEM - Host deployment-memc06 is DOWN: CRITICAL - Host Unreachable (172.16.5.12) [14:53:22] PROBLEM - Host deployment-logstash2 is DOWN: CRITICAL - Host Unreachable (172.16.5.22) [14:53:54] PROBLEM - Host deployment-mediawiki-07 is DOWN: CRITICAL - Host Unreachable (172.16.4.119) [14:56:21] RECOVERY - Host deployment-ircd is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [14:56:43] Project beta-scap-eqiad build #258508: 04STILL FAILING in 2 min 21 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/258508/ [15:00:02] PROBLEM - Host deployment-memc07 is DOWN: CRITICAL - Host Unreachable (172.16.5.2) [15:06:42] Project beta-scap-eqiad build #258509: 04STILL FAILING in 2 min 20 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/258509/ [15:06:43] RECOVERY - Host deployment-memc06 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [15:07:46] RECOVERY - Host deployment-memc07 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [15:16:32] Project beta-scap-eqiad build #258510: 04STILL FAILING in 2 min 9 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/258510/ [15:23:01] RECOVERY - Host deployment-kafka-jumbo-1 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [15:28:39] Project beta-scap-eqiad build #258511: 04STILL FAILING in 4 min 11 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/258511/ [15:28:56] PROBLEM - Host deployment-puppetdb02 is DOWN: CRITICAL - Host Unreachable (172.16.4.104) [15:29:15] PROBLEM - Host deployment-ores01 is DOWN: CRITICAL - Host Unreachable (172.16.4.95) [15:33:02] PROBLEM - Host deployment-mwmaint01 is DOWN: CRITICAL - Host Unreachable (172.16.4.16) [15:33:57] RECOVERY - Host deployment-mediawiki-07 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [15:36:40] Project beta-scap-eqiad build #258512: 04STILL FAILING in 2 min 17 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/258512/ [15:38:38] RECOVERY - Host deployment-puppetdb02 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [15:39:03] RECOVERY - Host deployment-mwmaint01 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [15:40:47] RECOVERY - Host deployment-jobrunner03 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [15:46:34] Yippee, build fixed! [15:46:34] Project beta-scap-eqiad build #258513: 09FIXED in 2 min 7 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/258513/ [16:02:48] RECOVERY - Host deployment-ores01 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [16:08:19] RECOVERY - Host deployment-logstash2 is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms [16:26:36] Project beta-scap-eqiad build #258517: 04FAILURE in 2 min 12 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/258517/ [16:27:38] PROBLEM - Host deployment-urldownloader02 is DOWN: CRITICAL - Host Unreachable (172.16.4.11) [16:27:54] PROBLEM - Host deployment-puppetmaster03 is DOWN: CRITICAL - Host Unreachable (172.16.4.91) [16:30:03] PROBLEM - Host deployment-restbase02 is DOWN: CRITICAL - Host Unreachable (172.16.5.82) [16:30:04] PROBLEM - Host deployment-restbase01 is DOWN: CRITICAL - Host Unreachable (172.16.5.26) [16:30:25] PROBLEM - Host deployment-sentry01 is DOWN: CRITICAL - Host Unreachable (172.16.5.16) [16:32:02] PROBLEM - Host deployment-snapshot01 is DOWN: CRITICAL - Host Unreachable (172.16.4.132) [16:34:14] RECOVERY - Host deployment-puppetmaster03 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [16:36:30] Project beta-scap-eqiad build #258518: 04STILL FAILING in 2 min 3 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/258518/ [16:37:38] RECOVERY - Host deployment-urldownloader02 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [16:38:26] PROBLEM - Host deployment-zookeeper02 is DOWN: CRITICAL - Host Unreachable (172.16.5.55) [16:40:04] RECOVERY - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [16:42:00] RECOVERY - Host deployment-snapshot01 is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [16:44:56] RECOVERY - Host deployment-sentry01 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [16:46:28] Yippee, build fixed! [16:46:28] Project beta-scap-eqiad build #258519: 09FIXED in 2 min 3 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/258519/ [16:48:27] RECOVERY - Host deployment-zookeeper02 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [16:54:37] RECOVERY - Host deployment-restbase02 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [17:10:20] the deployment-prep move is done now; let me know if you find anything that's newly broken [17:13:54] andrewbogott: I get “Firefox can’t establish a connection to the server at de.wikipedia.beta.wmflabs.org” when opening https://de.wikipedia.beta.wmflabs.org/wiki/Data-Bridge [17:14:04] curl says “Failed to connect to de.wikipedia.beta.wmflabs.org port 443: Connection refused” [17:15:59] Lucas_WMDE: it may be that there's a service somewhere that doesn't know to restart itself. Any idea what host is returning that 443? [17:16:29] no idea – it’s not returning anything, it’s refusing connections [17:16:36] port 80 works but just returns a TLS redirect [17:17:02] (the TLS redirect on port 80 comes from deployment-cache-text05, I think, if that helps…) [17:17:18] (if I’m reading the X-Cache header correctly) [17:19:06] yeah, seems to be instance-deployment-cache-text05.deployment-prep.wmflabs.org. (IP 185.15.56.36) [17:19:57] PROBLEM - Host integration-slave-jessie-1002 is DOWN: CRITICAL - Host Unreachable (172.16.1.99) [17:20:14] (03PS1) 10Thcipriani: localdev: add trigger jobs for mediawiki/core [integration/config] - 10https://gerrit.wikimedia.org/r/524565 (https://phabricator.wikimedia.org/T218360) [17:21:29] hm, puppet is broken on that VM [17:21:51] thcipriani: "Could not find template 'varnish/zero.inc.vcl.erb' at /etc/puppet/modules/varnish/manifests/wikimedia_vcl.pp:37:28 at /etc/puppet/modules/varnish/manifests/instance.pp:40 on node deployment-cache-text05.deployment-prep.eqiad.wmflabs" [17:21:56] probably preventing services from restarting [17:22:35] RECOVERY - Host integration-slave-jessie-1002 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [17:22:51] well neat. [17:23:06] thcipriani: may or may not be the cause of things being down now :/ [17:23:22] the varnish layer seems like a likely candidate [17:23:39] I just did a manual 'service varnish restart' but that doesn't seem to help [17:23:53] * thcipriani checks backendds [17:29:28] backends are working [17:29:37] something something tls layer? [17:30:39] PROBLEM - Host integration-slave-docker-1041 is DOWN: CRITICAL - Host Unreachable (172.16.1.36) [17:32:49] hrm, I don't see anything in webproxies for wikipedia.beta.wmflabs.org [17:38:04] andrewbogott: The Zero template was removed a couple of weeks ago from prod; did it not get updated on the Beta Cluster? [17:39:04] James_F: seems likely. Do you have a patch or a phab link or something about the prod change? [17:39:07] andrewbogott: da1b4c7429 [17:39:13] thanks [17:39:55] Also 69cc75d573 [17:40:16] thcipriani: want me to look at that puppet issue or are you already on it? [17:40:42] RECOVERY - Host integration-slave-docker-1041 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [17:42:15] andrewbogott: I could still use the assistance. I am not sure what's broken yet :\ I've made it as far as: varnish gives me a TLS redirect [17:42:31] ok [17:42:39] I'll see what I can do about getting puppet working; maybe it'll fix itself [17:43:54] PROBLEM - Host integration-slave-docker-1040 is DOWN: CRITICAL - Host Unreachable (172.16.3.86) [17:44:09] I'm working down that line of attack as well [17:45:54] guessing some hieradata somewhere in the tree of hieradata didn't get removed with the zero role [17:49:35] https://www.irccloud.com/pastebin/I9UGNDH8/ [17:49:41] Shall I just strike out - zero from that? [17:50:01] * andrewbogott is bold [17:50:07] yes please [17:50:13] where did you find that? [17:50:48] horizon, puppet prefix for deployment-cache [17:51:00] now puppet is happy, we'll see if that brings the site back up [17:52:38] hm, well, we have a puppet catalog but an invalid nginx config now [17:53:11] because of acme-chief I think [17:54:38] yeah, I saw failures for that in the puppet run [17:56:05] RECOVERY - Host integration-slave-docker-1040 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [17:56:58] Was acme-chief ever properly running on deployment-prep or is it too new? [17:57:12] I don't know [17:58:11] https://phabricator.wikimedia.org/T221268#5119837 [17:58:16] I'm pinging ema in another channel in case he's around to help [17:58:17] seems to indicate that it maybe did [17:58:41] hm [17:58:53] so Krenair might be of help as well if he has time [18:01:15] yeah, found a few things from Krenair that are vaguely related: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/501346/ [18:01:54] hi andrewbogott [18:02:05] what's the issue regarding acme-chief? [18:02:07] hello! [18:02:10] vgutierrez: I don't know enough yet to answer that intelligently. The short answer is 'puppet errors on the deployment-prep varnishes' [18:02:23] for starters [18:02:25] https://www.irccloud.com/pastebin/LY1ZjGuF/ [18:02:54] and [18:02:54] hrm [18:03:04] interesting [18:03:07] and /usr/local/sbin/update-ocsp seems to not exist [18:03:09] I wonder if we should event be using acme_chief here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/hieradata/labs.yaml#90 [18:03:25] what's the script to resolve a heira value again? [18:03:40] (03CR) 10Jforrester: [C: 03+1] "Looks good. Deploy now?" [integration/config] - 10https://gerrit.wikimedia.org/r/524565 (https://phabricator.wikimedia.org/T218360) (owner: 10Thcipriani) [18:03:56] ah, sorry, disregard my comment about it not existing [18:03:59] I was typing in the wrong term [18:04:07] what's failing is the ocsp stapling for the wikibase cert [18:04:27] can you paste the error? [18:04:31] vgutierrez: we can arrange for you to have a login on the affected instance if you don't have one already. [18:04:34] https://www.irccloud.com/pastebin/NPE3Etur/ [18:05:05] ^ that one? [18:06:23] hmm is the acme-chief server for deployment-preps working as expected? [18:06:47] vgutierrez: I don't know — neither of us has touched this until 5 minutes ago :/ [18:06:52] krenair set that up.. so he is the one familiar with it [18:07:05] The backstory is that I rebooted some of these hosts for maintenance, and then fixed a latent puppet issue, and now here we are [18:08:47] we have a deployment-acme-chief03 and deployment-acme-chief04 it seems [18:08:52] * thcipriani looks at those [18:08:55] vgutierrez: looks like you have access already, the host in question is deployment-cache-text05.deployment-prep.eqiad.wmflabs (although probably it's hitting a couple of others) [18:11:34] looks like acme-chief service isn't started on acme-chief03 [18:11:53] started now [18:12:03] * thcipriani re-runs puppet [18:12:05] yup... something went wrong and wikibase cert issue got interrupted [18:12:18] that's why update-ocsp is complaining [18:12:26] thcipriani: so maybe the service isn't set up properly and doesn't start up on boot? [18:12:38] mayhaps [18:12:52] what's the host name of the acme-chief server? [18:14:00] hm, puppet is still failing [18:14:10] sure, the problem isn't solved [18:14:17] deployment-acme-chief03 and deployment-acme-chief04 [18:14:21] thx [18:14:37] deployment-acme-chief04 is having some trouble starting acme-chief it seems [18:15:18] from journalctl there: https://phabricator.wikimedia.org/P8778 [18:16:00] right.. I'm seeing that in 03 as well [18:17:14] there's something missing in acme-chief config in those servers [18:18:15] right both challenges --> dns-01 --> issuing_ca & ns_records are empty [18:19:57] I need to get some lunch and change venue — back later! [18:20:06] https://www.irccloud.com/pastebin/M7XQ7RQv/ [18:20:41] thcipriani: that needs to be added to the hieradata for acme-chief03 & 04 [18:20:56] * thcipriani adds in horizon [18:21:11] you should have that section already [18:21:24] but issuing_ca and ns_records aren't probably there [18:22:53] the hieradata for beta is kind of...strewn in a few places. I'm fighting with horizon for it now. [18:23:50] vgutierrez: what's currently active for beta: https://phabricator.wikimedia.org/P8778#52822 [18:24:01] that's right [18:24:08] just append issuing_ca && ns_records [18:24:17] k [18:25:00] ok, should be updated [18:25:17] running puppet in 03 then [18:27:03] hmmm [18:27:14] those ns servers are invalid? [18:28:04] right... [18:28:14] those are renamed to cloud-ns0.wm.o and cloud-ns1.wm.o right thcipriani? [18:29:07] I hadn't heard about that, but those do seem to return something from dig whereas the others don't, should I try to update? [18:29:53] updated. [18:29:56] thx [18:31:09] we need to change the other configured fields as well [18:31:33] right now the dns-01 config looks like this [18:31:41] https://www.irccloud.com/pastebin/0WSReq4J/ [18:32:00] sync_dns_servers and validation_dns_servers need to be updated as well [18:32:15] I'm flexing my memory... 01:30 AM here /o\ [18:33:04] ugh :( [18:33:13] I'm not familiar with your puppetization [18:33:41] but somewhere you're feeding acme_chief::server a parameter called authdns_servers [18:34:44] in production I think it's populated with hieradata/common.yaml --> authdns_servers [18:35:05] * thcipriani digs [18:36:04] but I don't see the same key on hieradata/labs.yaml [18:36:58] found it [18:37:21] I think that should be updated as well [18:37:26] it's in the prefix-puppet for deployment-acme-chief [18:37:32] updated now [18:37:37] running puppet in 03... [18:38:08] * thcipriani running on 04 [18:39:01] hmmm 04 shouldn't run acme-chief [18:39:10] cause it should behave as a passive node [18:39:32] ok.. now it looks slightly better [18:39:39] but acme-chief is still unable to issue certs there [18:40:58] hmmm [18:41:16] those nodes don't have IPv6 connectivity, right? [18:41:23] right [18:41:49] but cloud-ns0.wikimedia.org. DNS record has both A and AAAA records [18:42:01] so acme-chief thinks that the challenge injection has failed [18:43:51] hmmm thcipriani can you set in authdns_servers the IPv4 IPs for cloud-ns0 and cloud-ns1 instead of the DNS records? [18:44:03] * thcipriani tries [18:44:04] so 208.80.154.135 and 208.80.154.11? [18:44:56] the ns_records too? or just authdns_servers? [18:44:59] nope [18:45:05] just authdns_servers please [18:45:08] k [18:45:21] updated. [18:45:30] but I think we will have issues when acme-chief tries to ssh the DNS server using the IP instead of the hostname :/ [18:46:04] or whatever does the cloud provisioner [18:46:10] why? host-certs? [18:47:36] hmmm it worked [18:47:51] acme-chief03 has been able to issue the certs [18:48:43] and deployment-cache-text05 got it after I ran puppet there [18:48:44] so [18:48:46] root@deployment-cache-text05:/etc/acmecerts/wikibase/new# openssl x509 -dates -noout -in rsa-2048.crt [18:48:46] notBefore=Jul 19 17:46:06 2019 GMT [18:48:57] ocsp-stapling shouldn't cry anymore [18:49:30] and, magically, beta is back! [18:49:35] thank you so much vgutierrez !! [18:49:43] yeah... I'll go back to my vacations [18:49:44] ;P [18:49:46] :D [18:49:49] enjoy! [18:49:54] thanks again [18:49:58] so two things happened here [18:50:15] acme-chief03 upgrades acme-chief automagically apparently [18:50:35] so it got the version 0.19 that required some config updates [18:50:47] and your (labs) DNS servers got renamed [18:50:58] but the acme-chief config didn't reflect that change [18:51:29] so when the certificates expired and acme-chief tried to get new ones... things got ugly [18:51:59] ah [18:52:14] that makes sense [18:53:48] * vgutierrez off [18:54:07] toodles! [19:04:36] (03CR) 10Brennen Bearnes: [C: 03+2] localdev: add trigger jobs for mediawiki/core [integration/config] - 10https://gerrit.wikimedia.org/r/524565 (https://phabricator.wikimedia.org/T218360) (owner: 10Thcipriani) [19:06:56] (03Merged) 10jenkins-bot: localdev: add trigger jobs for mediawiki/core [integration/config] - 10https://gerrit.wikimedia.org/r/524565 (https://phabricator.wikimedia.org/T218360) (owner: 10Thcipriani) [19:10:26] Beta Cluster RB is down still. [19:10:48] Presumably I can't just do a service restart? [19:11:47] !log Reloading Zuul to deploy https://gerrit.wikimedia.org/r/c/integration/config/+/524565 [19:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [19:11:49] hopefully :) [19:14:38] https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Help doesn't even list where services are hosted. [19:16:36] looks like there's a deployment-restbase{1,2} [19:16:44] er 01,02 [19:17:07] Thanks. [19:17:53] RB seems to be running on restbase01 at least. [19:18:56] https://restbase-beta.wmflabs.org/ just gives 502. [19:36:05] thcipriani: I am finally back… is everything fixed now? [19:36:16] * andrewbogott reads more backscroll [19:37:23] andrewbogott: beta wikis seem to be back up vgutierr.ez got acme-chief working right again, James_F was looking at restbase just now though... [19:40:57] Yeah, I got nowhere. [19:41:09] Couldn't find out where RB is actually controlled from; it seems to be running fine. [19:41:52] (But not routing. Either it's broken and needs restarting or something in front of it is broken/mis-pointed.) [19:43:23] looks like it's trying to connect to something (cassandra maybe?) [19:43:28] I'm going to see where else puppet is broken... [19:43:41] lots of Error: connect ECONNREFUSED 172.16.5.26:9042 in the logs [19:44:25] yeah...cassandra is dead for some reason... [19:46:08] broken on… deployment-sentry01.deployment-prep.eqiad.wmflabs, deployment-pdfrender02.deployment-prep.eqiad.wmflabs, deployment-mediawiki-09.deployment-prep.eqiad.wmflabs, deployment-maps04.deployment-prep.eqiad.wmflabs, deployment-logstash03.deployment-prep.eqiad.wmflabs, deployment-chromium01.deployment-prep.eqiad.wmflabs, deployment-cache-upload05.deployment-prep.eqiad.wmflabs [19:49:04] (03CR) 10Krinkle: "I'm not sure that removing this information from the codebases with an autofix would be an improvement." [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/498472 (https://phabricator.wikimedia.org/T218324) (owner: 10Legoktm) [19:49:26] well. i started cassandra manually and that worked. [19:49:36] I don't know why I can start from the service unit [19:49:58] that made this happen at least https://restbase-beta.wmflabs.org/ [19:57:28] I'm dependency hell on deployment-mediawiki-09 — would you expect it to have the same deb catalog as other deployment-mediawiki-xx hosts? [19:58:31] I saw that one [19:58:45] I would expect it to be the same as deployment-mediawiki-07, yeah [19:58:58] ok, I'll see if I can wrestle it into compliance [20:00:20] deployment-maaps04 fixed, deployment-pdfrenderer02 looks like it should maybe go away https://phabricator.wikimedia.org/T226675#5350326 [20:03:48] 10Release-Engineering-Team-TODO, 10Developer Productivity, 10Release Pipeline, 10local-charts: Define a .pipeline/blubber.yaml for mediawiki/core - https://phabricator.wikimedia.org/T218360 (10Jdforrester-WMF) OK, this now triggers (and fails because we haven't added `.pipeline/config.yaml` in mediawiki/co... [20:08:19] andrewbogott: hrm, how do you pass a boolean as a class parameter through the horizon interface? Keep getting "parameter 'use_nodejs10' expects a Boolean value, got String" [20:08:27] using true [20:08:43] hm, that's new to me [20:08:55] true or "true" ? [20:09:17] true without quotes (at least in horizon) [20:09:37] And it's the web UI giving you that error? [20:09:39] PROBLEM - Host deployment-pdfrender02 is DOWN: CRITICAL - Host Unreachable (172.16.5.43) [20:09:51] nah, puppet error on the machine [20:10:00] oh! [20:10:10] well, I bet I know why that changed [20:10:58] maybe https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/522992/? [20:11:06] hm, no, that's only for alerting [20:11:18] 10Release-Engineering-Team-TODO, 10Developer Productivity, 10Release Pipeline, 10local-charts: Define a .pipeline/blubber.yaml for mediawiki/core - https://phabricator.wikimedia.org/T218360 (10brennen) Yep - my last comment here turns out to be wrong and I think this will work, getting a version of https:/... [20:11:51] well...this box seems to only have nodejs6 on it, and the parameter defaults to false, so it seems like it might be fine in this instance :) [20:12:00] is this for kartotherian? Or tilerator? [20:12:16] that's the easy way out for now [20:12:16] this was for deployment-chromium01 [20:12:49] I got deployment-mediawiki-09 working properly, finally [20:13:10] I'm going to step back from deployment-prep for now but ping me if you run into other interesting things :) [20:13:47] k, thanks for you help [20:15:00] Oh right, yes, proton is now node10. [20:15:29] I guess we/someone need to update that box to have node10. [20:18:17] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:24:17] that cherry-pick check is still a thing? [20:49:28] Maybe rebooting the server cause it to re-alert? [21:06:40] twentyafterfour: phab2001 is currently the only node left where puppet fails. it's known because of the aphlict service but i wanna try fix it in puppet [21:06:53] to get it out of https://puppetboard.wikimedia.org/nodes?status=failed [21:06:57] oh [21:07:16] oh wait [21:07:19] mutante: it's failing because aphlict isn't starting? can we just make it optional? [21:07:21] this might be new [21:07:22] No such file or directory @ dir_chdir - /srv/phab/phabricator/support [21:07:48] aphlict failed because of failed dependencies [21:08:07] Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /srv/phab/phabricator/support/aphlict/server (file: /etc/puppet/modules/phabricator/manifests/aphlict.pp, line: 38) [21:08:20] that support dir isnt here [21:08:28] because a deploy created it..right [21:08:48] our common issue of directories handled by puppet or by deployment i guess [21:10:58] mutante: fixed by checking out the submodules [21:11:02] twentyafterfour: option a) you deploy latest phab to phab2001 so that /srv/phab/phabricator/support gets created by that? [21:11:15] oh..suckmod.. submodules [21:11:35] twentyafterfour: thanks!:) [21:12:21] running puppet [21:12:49] all green. and https://puppetboard.wikimedia.org/nodes?status=failed is clean. yay [21:13:48] awesome [22:02:12] thcipriani: any idea if integration-cumin downtime will cause anyone trouble? [22:02:41] or, twentyafterfour, same question [22:24:04] andrewbogott: integration-cumin downtime should be fine [22:24:26] nothing tied to that, only used for admin tasks (AFAIK) [22:25:36] thcipriani: great, I'm going to move that now then [22:27:57] okie doke [22:46:06] 10Phabricator, 10Release-Engineering-Team (Kanban), 10Operations, 10serviceops, and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10Dzahn) now also phab2001 has been switched to php-fpm and wor... [23:36:21] (03PS1) 1020after4: Support ubuntu 16.04 [releng/local-charts] - 10https://gerrit.wikimedia.org/r/524623 [23:39:28] (03CR) 1020after4: "Note: the package name changes might actually be valid for 18.04 as well but I limited the case to 16.04 specifically by checking the dist" [releng/local-charts] - 10https://gerrit.wikimedia.org/r/524623 (owner: 1020after4) [23:40:58] (03PS2) 1020after4: Support ubuntu 16.04 [releng/local-charts] - 10https://gerrit.wikimedia.org/r/524623