[00:57:37] PROBLEM - Free space - all mounts on deployment-fluorine02 is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine02.diskspace._srv.byte_percentfree (<22.22%) [03:37:13] paladox: awesome, thanks! (and thank you thcipriani and mutante and anyone else involved in the gerrit-replica) [06:51:26] 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO: Create mirror of Gerrit repositories for consumption by various tools - https://phabricator.wikimedia.org/T226240 (10MarcoAurelio) How updated is `gerrit-replica`? Is it immediatelly updated after gerrit (master)? T... [06:57:37] RECOVERY - Free space - all mounts on deployment-fluorine02 is OK: OK: All targets OK [07:35:14] 10Release-Engineering-Team, 10Gerrit-Privilege-Requests, 10Language-Team (Language-2019-July-September): Add Santhosh and Petar to wmf-deployment group - https://phabricator.wikimedia.org/T229777 (10KartikMistry) p:05Triage→03Normal [07:35:43] 10Release-Engineering-Team, 10Gerrit-Privilege-Requests, 10Language-Team (Language-2019-July-September): Add Santhosh and Petar to wmf-deployment group - https://phabricator.wikimedia.org/T229777 (10KartikMistry) Confirmed that Santhosh and Petar has +2 rights on `deployment-charts` now. [07:44:05] PROBLEM - Puppet staleness on webperformance is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [43200.0] [10:15:47] 10Release-Engineering-Team, 10Gerrit-Privilege-Requests, 10Language-Team (Language-2019-July-September): Add Santhosh and Petar to wmf-deployment group - https://phabricator.wikimedia.org/T229777 (10Petar.petkovic) 05Open→03Resolved a:03Petar.petkovic [10:16:01] 10Release-Engineering-Team, 10Gerrit-Privilege-Requests, 10Language-Team (Language-2019-July-September): Add Santhosh and Petar to wmf-deployment group - https://phabricator.wikimedia.org/T229777 (10Petar.petkovic) a:05Petar.petkovic→03KartikMistry [11:32:56] 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201908), 10Release, 10Train Deployments: 1.34.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T220742 (10Urbanecm) [11:42:25] !log ssh -p 29418 gerrit.wikimedia.org replication start operations/puppet --wait [11:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [11:43:09] 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10Scap, 10Performance-Team (Radar), 10User-zeljkofilipin: Changing dblist files requires mtime touch of InitialiseSettings.php - https://phabricator.wikimedia.org/T217830 (10Krinkle) [13:15:10] 10Release-Engineering-Team, 10Core Platform Team Workboards (Green): Is a Jenkins slave with Stretch available? - https://phabricator.wikimedia.org/T229925 (10holger.knust) [13:17:02] 10Release-Engineering-Team, 10Core Platform Team Workboards (Green): Is a Jenkins slave with Stretch available? - https://phabricator.wikimedia.org/T229925 (10holger.knust) [13:28:07] thcipriani hi, around? :) [13:28:19] 10Release-Engineering-Team, 10Core Platform Team Workboards (Green): Is a Jenkins slave with Stretch available? - https://phabricator.wikimedia.org/T229925 (10Krinkle) Note that most (all?) Jenkins jobs don't execute on the Jenkins agents directly. Rather, they run inside a container, and [most containers](htt... [14:03:09] (03PS5) 10Awight: Experiment with frozen classes [integration/quibble] - 10https://gerrit.wikimedia.org/r/515889 [14:24:32] (03CR) 10Krinkle: Reduce side-effects (031 comment) [integration/quibble] - 10https://gerrit.wikimedia.org/r/520240 (owner: 10Awight) [14:27:52] (03CR) 10Awight: Reduce side-effects (031 comment) [integration/quibble] - 10https://gerrit.wikimedia.org/r/520240 (owner: 10Awight) [14:42:13] (03PS4) 10Awight: Reduce side-effects [integration/quibble] - 10https://gerrit.wikimedia.org/r/520240 [14:43:34] (03PS5) 10Awight: Reduce side-effects [integration/quibble] - 10https://gerrit.wikimedia.org/r/520240 [14:45:21] 10Phabricator (Upstream), 10Upstream: Multiple grep results in one line displayed incorrectly - https://phabricator.wikimedia.org/T197935 (10Dvorapa) Cool! [14:46:56] (03CR) 10jerkins-bot: [V: 04-1] Reduce side-effects [integration/quibble] - 10https://gerrit.wikimedia.org/r/520240 (owner: 10Awight) [14:55:57] (03PS6) 10Awight: Reduce side-effects [integration/quibble] - 10https://gerrit.wikimedia.org/r/520240 [14:56:45] (03CR) 10jerkins-bot: [V: 04-1] Reduce side-effects [integration/quibble] - 10https://gerrit.wikimedia.org/r/520240 (owner: 10Awight) [15:42:44] (03PS7) 10Awight: Reduce side-effects [integration/quibble] - 10https://gerrit.wikimedia.org/r/520240 [15:50:52] (03CR) 10jerkins-bot: [V: 04-1] Reduce side-effects [integration/quibble] - 10https://gerrit.wikimedia.org/r/520240 (owner: 10Awight) [17:12:29] 10Gerrit, 10Release-Engineering-Team-TODO (201908): Gerrit -> GitHub replication not up-to-date - https://phabricator.wikimedia.org/T229945 (10thcipriani) [17:21:21] 10Gerrit, 10Release-Engineering-Team-TODO (201908): Gerrit -> GitHub replication not up-to-date - https://phabricator.wikimedia.org/T229945 (10thcipriani) p:05Triage→03Normal The logs don't seem to say much about this, other than "github" hasn't been mentioned at all today, but was mentioned quite a bit y... [17:42:47] 10Gerrit, 10Release-Engineering-Team-TODO (201908): Gerrit -> GitHub replication not up-to-date - https://phabricator.wikimedia.org/T229945 (10thcipriani) I can say that, of the above code, the line `for (Destination cfg : config.getDestinations(FilterType.ALL)) {` is definitely finding the `github` remote. A... [17:50:37] 10Release-Engineering-Team, 10Core Platform Team Workboards (Green): Is a Jenkins slave with Stretch available? - https://phabricator.wikimedia.org/T229925 (10holger.knust) 05Open→03Resolved a:03holger.knust That should work then. Thank you! Closing the ticket. [17:59:23] 10Gerrit, 10Release-Engineering-Team-TODO (201908): Gerrit -> GitHub replication not up-to-date - https://phabricator.wikimedia.org/T229945 (10thcipriani) `mirror = false` definitely broke it, looking at the logs: Last replication event happened at 20:40:52 ` thcipriani@cobalt:~$ grep -i github /var/log/ge... [18:00:20] thcipriani ah! [18:43:14] thcipriani: fyi I'm moving integration-slave-docker-1041, will repool when it's done [18:43:31] crazy load on its old host so it might be a bit peppier after the move [18:43:59] (hm, on second thought, not sure who I should be pinging about that) [18:44:25] probably me, it looks offline in the jenkins ui so probably fine :) [18:45:02] yeah, i depooled and waited for it to finish before shutting it down [18:45:09] PROBLEM - Host integration-slave-docker-1041 is DOWN: CRITICAL - Host Unreachable (172.16.1.36) [18:52:05] thanks for that :) [19:04:51] (03PS2) 10Umherirrender: Create new sniff for doc comments [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/528280 [19:05:12] (03CR) 10Umherirrender: "I have addressed that issue by a new config" [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/528280 (owner: 10Umherirrender) [19:20:43] Project beta-update-databases-eqiad build #35794: 04FAILURE in 43 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/35794/ [19:24:21] Project beta-code-update-eqiad build #258167: 04FAILURE in 1 min 21 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/258167/ [19:26:24] !log ssh -p 29418 gerrit.wikimedia.org replication start --url github mediawiki/core --wait --now [19:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [19:34:06] paladox: I wonder if you'll be waiting a while with that --wait? we still have the replicateonstartup thing going. [19:34:24] Yippee, build fixed! [19:34:25] Project beta-code-update-eqiad build #258168: 09FIXED in 1 min 24 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/258168/ [19:34:31] I think so (the --wait is to prevent it going to a background thread) [19:34:40] * paladox cancles it [19:38:00] I want to try that command again once the queue is clear though [19:38:35] ok [19:41:29] (03PS3) 10Umherirrender: Create new sniff for doc comments [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/528280 [19:42:23] (03CR) 10Umherirrender: "Fixed a problem with class comments without indent" [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/528280 (owner: 10Umherirrender) [19:56:27] thcipriani queues empty now :) [19:56:32] (mostly) [19:57:37] well, seems to push to github whereas it didn't before [19:57:45] when I do replication start [19:57:51] didn't change any configuration [19:58:15] so seems like your theory of some kind of configuration reload bug in the replication plugin is correct [19:58:24] :) [20:07:39] RECOVERY - Host integration-slave-docker-1041 is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [20:19:11] 10MediaWiki-Codesniffer, 10Patch-For-Review: Sniff to detect if there is no one extra newlines between function definitions - https://phabricator.wikimedia.org/T213861 (10Umherirrender) 05Open→03Resolved p:05Triage→03Normal a:03Umherirrender [20:20:05] (03CR) 10Thcipriani: [C: 03+2] php7x: restart php-fpm after all sync operations [tools/scap] - 10https://gerrit.wikimedia.org/r/525119 (https://phabricator.wikimedia.org/T224857) (owner: 10Thcipriani) [20:21:21] 10MediaWiki-Codesniffer: Function comments should have same level/indent as the function - https://phabricator.wikimedia.org/T229971 (10Umherirrender) [20:21:25] Yippee, build fixed! [20:21:26] Project beta-update-databases-eqiad build #35795: 09FIXED in 1 min 25 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/35795/ [20:24:05] (03Merged) 10jenkins-bot: php7x: restart php-fpm after all sync operations [tools/scap] - 10https://gerrit.wikimedia.org/r/525119 (https://phabricator.wikimedia.org/T224857) (owner: 10Thcipriani) [20:24:53] (03CR) 10jenkins-bot: php7x: restart php-fpm after all sync operations [tools/scap] - 10https://gerrit.wikimedia.org/r/525119 (https://phabricator.wikimedia.org/T224857) (owner: 10Thcipriani) [20:28:53] 10Gerrit, 10Release-Engineering-Team-TODO (201908): Gerrit -> GitHub replication not up-to-date - https://phabricator.wikimedia.org/T229945 (10thcipriani) 05Open→03Resolved a:03thcipriani >>! In T229945#5396935, @thcipriani wrote: > I can say that, of the above code, the line `for (Destination cfg : con... [20:53:36] (03PS1) 10Umherirrender: [TheWikipediaLibrary] Run job for seccheck [integration/config] - 10https://gerrit.wikimedia.org/r/528568 [20:57:26] (03PS1) 10Umherirrender: [TheWikipediaLibrary] Add phan dependencies [integration/config] - 10https://gerrit.wikimedia.org/r/528570 [20:58:53] (03CR) 10Jforrester: [C: 03+2] [TheWikipediaLibrary] Add phan dependencies [integration/config] - 10https://gerrit.wikimedia.org/r/528570 (owner: 10Umherirrender) [21:00:22] (03Merged) 10jenkins-bot: [TheWikipediaLibrary] Add phan dependencies [integration/config] - 10https://gerrit.wikimedia.org/r/528570 (owner: 10Umherirrender) [21:00:37] (03CR) 10Jforrester: [C: 03+2] [TheWikipediaLibrary] Run job for seccheck [integration/config] - 10https://gerrit.wikimedia.org/r/528568 (owner: 10Umherirrender) [21:01:06] !log Zuul: [TheWikipediaLibrary] Add phan dependencies [21:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [21:02:03] (03Merged) 10jenkins-bot: [TheWikipediaLibrary] Run job for seccheck [integration/config] - 10https://gerrit.wikimedia.org/r/528568 (owner: 10Umherirrender) [21:05:03] !log Zuul: [TheWikipediaLibrary] Enable phan-seccheck [21:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [21:06:23] (03PS1) 10Jforrester: layout: [TheWikipediaLibrary] Enable phan [integration/config] - 10https://gerrit.wikimedia.org/r/528571 [21:12:23] (03PS1) 10Umherirrender: [TheWikipediaLibrary] Run job for phan [integration/config] - 10https://gerrit.wikimedia.org/r/528574 [21:19:56] (03CR) 10Jforrester: [C: 03+2] [TheWikipediaLibrary] Run job for phan [integration/config] - 10https://gerrit.wikimedia.org/r/528574 (owner: 10Umherirrender) [21:20:02] (03Abandoned) 10Jforrester: layout: [TheWikipediaLibrary] Enable phan [integration/config] - 10https://gerrit.wikimedia.org/r/528571 (owner: 10Jforrester) [21:21:57] (03Merged) 10jenkins-bot: [TheWikipediaLibrary] Run job for phan [integration/config] - 10https://gerrit.wikimedia.org/r/528574 (owner: 10Umherirrender) [21:24:05] !log Zuul: [TheWikipediaLibrary] Enable plan [21:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [21:24:11] Bah. Phan. [21:43:28] 10Release-Engineering-Team, 10Core Platform Team: Need help to create and deploy Debian-packaged app - https://phabricator.wikimedia.org/T229980 (10holger.knust) [21:54:24] How to deal with scap deployments and down hosts? I tried to deploy search/mjolnir/deploy and it tries to deploy to a server that's currently depooled for failed disks. This causes scap to only ask if i want to rollback, not allowing a continue [21:56:00] thcipriani: my best guess i you might know? ^ [21:57:20] is it necessary to add and remove servers from the dsh groups whenever one is under maintenance? [21:57:58] ebernhardson: Shout in -operations or file a Phab task for the server to be properly depooled. [21:58:33] James_F: sounds like an ultimate "pet" problem :P [21:58:37] Scap uses the same pool as the load balancers. [21:58:49] in some cases :) [21:58:56] depending on how the dsh group is setup in puppet [21:59:04] Fair, yes, there are complications. [22:00:02] ebernhardson: there are a couple of ways, the easiest might be to set failure_limit in scap.cfg to 2 [22:00:14] thcipriani: how so? I can rewrite whatever puppet necessary. I'm not sure what the ideal situation is, there are complications related to having servers not getting deploys they were supposed to [22:00:34] i suppose that will work [22:00:39] (03CR) 10Jforrester: [C: 03+1] Create new sniff for doc comments [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/528280 (owner: 10Umherirrender) [22:01:13] is search/mjolnir/deploy behind pybal? IIUC you can use that to build the dsh files that scap uses [22:01:28] hmm, if scap uses the same pool as the load balancers then this deploy should work, search/mjolnir/deploy deploys to the elasticsearch servers which are also behind pybal [22:01:38] and pybal appropriately isn't sending requests to the down server afaik [22:01:45] * thcipriani digs in puppet to find the place [22:02:03] thcipriani: mjolnir itself isn't a web-accessible thing. So no pybal [22:02:16] thcipriani: but it's deployed on all the elasticsearch servers, which have pybal [22:03:25] hrm, it looks like there may be a cirrus dsh file linked to elasticsearch https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/hieradata/common/scap/dsh.yaml#89 [22:04:15] and there is a deploy1001:/etc/dsh/group/cirrus with a bunch of servers [22:04:29] thcipriani: so that should populate /etc/dsh/group/cirrus from pybal on each puppet run or some such? [22:04:39] well, i guess etcd or whatever pybal's source of data is [22:04:56] yep [22:05:00] if you want to look at pybal config: https://config-master.wikimedia.org/pybal/ [22:05:30] so if you want to deploy to those servers you could specify those as a dsh_targets in scap.cfg [22:05:37] so maybe this wasn't appropriately depooled... [22:06:31] ah, I see you're using this file already in scap.cfg [22:06:43] either that or puppet didn't run on deploy1001 [22:07:05] *hasn't run since it was depooled [22:08:58] ebernhardson: re " is it necessary to add and remove servers from the dsh groups whenever one is under maintenance?" it depends but after they come back from maintenance somebody should make sure that this host gets a scap run (push or pull) before it is repooled [22:09:11] so either on the host itself or from deploy host [22:09:19] thcipriani: one thing perhaps i'm missing, that dsh.yaml defines cirrus as using cluster:elasticsearch,service:elasticsearch. Pybal lists the search clusters under `search` and `search-https` (both of which seems to have elastic1046 depooled correctly) [22:09:35] mutante: right when they come back elasticsearch instances always get reimaged [22:09:43] generally safest [22:10:09] ebernhardson: yep, good [22:10:59] i guess the search one reports as /pools/eqiad/elasticsearch/elasticsearch/, so maybe we just have multiple names for everything [22:11:24] no way puppet hasn't run since, this server has been depooled for 10+ days. Hmm... [22:12:12] 10Release-Engineering-Team, 10CPT Initiatives (Session Management Service (CDP2)), 10Core Platform Team Workboards (Green): Need help to create and deploy Debian-packaged app - https://phabricator.wikimedia.org/T229980 (10Pchelolo) [22:12:19] ebernhardson: hrm, that's a good question; I don't know if it's out of date or one is an alias for the other [22:14:36] well...looks like a rabit hole...file a bug and tell scap to allow 2 failures.. [22:16:02] definitely a rabbit hole. I didn't work on this particular part of how scap works, so I'm a little clueless about its internals. [22:16:15] * James_F blames Bryan. ;-) [22:17:01] heh, this maybe came a bit after Bryan last touched scap :P [22:37:05] paladox: just to confirm, I can also make API requests against gerrit-replica? [22:37:23] legoktm sadly nope :( (rest api is disabled on a slave) [22:37:36] ok, so the only thing that works is cloning then? [22:37:42] yup [23:06:01] (03PS1) 10Umherirrender: [TimedMediaHandler] Add phan dependency [integration/config] - 10https://gerrit.wikimedia.org/r/528592 (https://phabricator.wikimedia.org/T224766) [23:57:34] (03PS1) 10Umherirrender: [TimedMediaHandler] Run phan job [integration/config] - 10https://gerrit.wikimedia.org/r/528600 (https://phabricator.wikimedia.org/T224766)