[00:23:04] RECOVERY - Puppet run on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [00:33:35] 03Scap3, 10scap: File ownership differences between Scap3 and Trebuchet - https://phabricator.wikimedia.org/T116632#2647176 (10thcipriani) 05Open>03Resolved a:03thcipriani This has been solved in `scap::target`. [00:36:29] 03Scap3: Local config deploys should use the target's current version - https://phabricator.wikimedia.org/T145373#2647179 (10thcipriani) a:03thcipriani [03:04:55] PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [03:06:43] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [03:12:43] Project mediawiki-core-code-coverage build #2271: 04STILL FAILING in 12 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage/2271/ [03:44:53] RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [03:46:44] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [05:43:37] PROBLEM - Puppet run on deployment-elastic08 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [05:54:10] RECOVERY - Long lived cherry-picks on puppetmaster on deployment-puppetmaster is OK: OK: Less than 100.00% above the threshold [0.0] [06:23:39] RECOVERY - Puppet run on deployment-elastic08 is OK: OK: Less than 1.00% above the threshold [0.0] [11:22:09] PROBLEM - Free space - all mounts on mira02 is CRITICAL: CRITICAL: deployment-prep.mira02.diskspace._srv.byte_percentfree (<11.11%) [11:22:33] Jenkins can't install mediawiki core [11:22:47] https://integration.wikimedia.org/ci/job/mwext-Wikibase-repo-tests-sqlite-hhvm/11859/console [11:29:56] Amir1: that repo is broken / not compatible with mediawiki/core @ master [11:30:17] Amir1: look at the stacktrace?! exception 'MediaWiki\Services\ServiceDisabledException' with message 'Service disabled: DBLoadBalancerFactory' in /mnt/jenkins-workspace/workspace/mwext-Wikibase-repo-tests-sqlite-hhvm/src/includes/Services/ServiceContainer.php:340 [11:30:34] okay, strange [11:30:37] aude: ^ [11:30:41] thanks hashar [11:31:27] Amir1: ask about it in #wikidata , there might be a bug filled for it already [11:31:44] I actually brought up the issue from there :D [11:32:10] RECOVERY - Free space - all mounts on mira02 is OK: OK: All targets OK [11:35:48] Amir1: https://phabricator.wikimedia.org/T146019 [11:35:56] maybe that's a duplicate, not sure [11:36:14] happens just with a fresh mediawiki install.... no wikibase [11:37:04] it's probably trivial to fix, but i have to look into the changes aaron has been doing [12:41:14] hey hashar ! [12:42:41] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10MediaWiki-Unit-tests, 10MediaWiki-extensions-WikibaseClient, and 4 others: Job mediawiki-extensions-php55 frequently fails due to "Segmentation fault" - https://phabricator.wikimedia.org/T142158#2525383 (10aude) at the moment, i am runn... [12:46:31] aude: that looks like a horrible segfault :/ [12:46:45] (03PS1) 10Tobias Gritschacher: Add 2ColConflict extension to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311408 (https://phabricator.wikimedia.org/T145411) [12:46:49] but now i'm not getting it :( [12:46:59] * aude doesn't know how to reliably reproduce this [12:47:35] btw, i have mw core wmf/1.28.0-wmf.19 checked out with whatever extensions it uses [12:48:47] aude: so there is one specific test that segfaults right? / should I try? [12:49:14] i don't know if it's a specific test [12:50:03] * aude tried 5 times now [12:54:36] addshore: aude: for the random php5.5 segfault, the CI slaves do generate a core dump for them [12:54:51] I ran one via gdb with a super long trace. [12:54:57] that hint at the garbage collector [12:55:08] I have pasted to whatever task is open about it [12:55:12] but haven't looked further [12:55:30] one thing I remember is that we had the Zend 5.3 garbage collector segfaulting [12:55:43] so went backporting a few patches to our debian package [12:55:57] and we also had some hack in phpunit.php to disable the garbage collector [12:56:13] i vaugly remember that has [12:56:14] hack [13:02:00] i had just checked out new / different code [13:02:23] so maybe more memory, etc. involved in running tests on fresh (uncached) code [13:02:46] and then maybe hit the garbage collector or something [13:14:32] checked out master and then checked out the branch again [13:14:35] reproduced [13:15:21] (03PS1) 10Tobias Gritschacher: Add ElectronPdfService extension to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311412 (https://phabricator.wikimedia.org/T142201) [13:17:17] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10MediaWiki-Unit-tests, 10MediaWiki-extensions-WikibaseClient, and 4 others: Job mediawiki-extensions-php55 frequently fails due to "Segmentation fault" - https://phabricator.wikimedia.org/T142158#2648071 (10aude) seems I am able to repro... [13:38:40] PROBLEM - Puppet run on deployment-apertium02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [13:43:27] hashar: do you know who is doing the train this week? [13:43:42] assuming the issues we had last week are resolved [13:49:52] 06Release-Engineering-Team (Deployment-Blockers), 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 07Beta-Cluster-reproducible, and 7 others: Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesi... - https://phabricator.wikimedia.org/T145819#2648220 [13:51:54] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10MediaWiki-Unit-tests, 10MediaWiki-extensions-WikibaseClient, and 4 others: Job mediawiki-extensions-php55 frequently fails due to "Segmentation fault" - https://phabricator.wikimedia.org/T142158#2648253 (10aude) with set env MALLOC_CHEC... [14:01:17] aude: will talk about it tonight [14:01:19] (03CR) 10Addshore: [C: 031] Add ElectronPdfService extension to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311412 (https://phabricator.wikimedia.org/T142201) (owner: 10Tobias Gritschacher) [14:01:32] (03CR) 10Addshore: [C: 031] Add 2ColConflict extension to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311408 (https://phabricator.wikimedia.org/T145411) (owner: 10Tobias Gritschacher) [14:02:12] aude: looks like the train this week will be Tyler and there is no train next wee [14:02:52] ok [14:03:15] MALLOC_CHECK_=3 !! [14:03:22] that is magic ( https://phabricator.wikimedia.org/T142158#2648253 ) [14:03:24] if possible, i'd like to deploy wmf20 to wikidata earlier in the day on wednesday [14:03:30] :) [14:03:43] well [14:03:50] we might push wmf.19 this week [14:03:56] wmf.20 I have no clue. [14:03:57] * aude will be on an airplane in the evening and hoo is busy with studeis [14:04:04] I guess you get some patches to catch up with recent changes in mw ? [14:04:06] ok, wmf19 with new wikibase [14:05:05] guess you can list out by editing the blocker task: https://phabricator.wikimedia.org/T143328 [14:05:09] just edit the task detail summary maybe [14:05:41] ok [14:06:02] right now i found a bug in lua on wikibase master [14:06:32] so i'm not sure, but we might at least want to backport some patches then to make sure wikibase is compatible with changes in core [14:07:02] and been trying to run our phpunit tests against wmf19 core + wikibase that is deployed now [14:10:03] aude: Iam pretty sure it is a bug in the php5.5 package we have on Trusty [14:10:10] it must be missing some fix to Zend garbage collectors [14:11:55] could be [14:11:59] think i have the same package [14:25:10] aude: and the ci Trusty slaves do capture core dumps [14:25:41] ok [14:25:54] i am able to reliably reproduce the issue now [14:25:56] for some reason [14:41:02] hashar hi, i managed to get integration.wikimedia.org homepage on http://gerrit-zuul.wmflabs.org/ :) [15:00:16] Project mediawiki-core-code-coverage build #2272: 04STILL FAILING in 14 sec: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage/2272/ [15:00:26] PROBLEM - Keyholder status on mira02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:10:52] 10Continuous-Integration-Infrastructure, 06Analytics-Kanban, 10Differential, 10EventBus, 10Wikimedia-Stream: Run Kasocki tests in Jenkins via Differential commits - https://phabricator.wikimedia.org/T145140#2648480 (10Nuria) 05Open>03Resolved [15:11:01] Trebuchet is broken on deployment-tin for /srv/deployment/jobrunner/jobrunner [15:11:08] and given it is just two hosts, I am not going to try to fix it [15:11:23] !log beta: updating jobrunner service 0dc341f..a0e8216 [15:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [15:19:48] (03CR) 10Paladox: [C: 031] Add 2ColConflict extension to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311408 (https://phabricator.wikimedia.org/T145411) (owner: 10Tobias Gritschacher) [15:20:02] (03CR) 10Paladox: [C: 031] Add ElectronPdfService extension to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311412 (https://phabricator.wikimedia.org/T142201) (owner: 10Tobias Gritschacher) [15:44:06] PROBLEM - Puppet run on deployment-db1 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:44:32] 10Beta-Cluster-Infrastructure, 03Scap3: Fixup beta scap3 keyholder problems - https://phabricator.wikimedia.org/T144647#2648622 (10thcipriani) >>! In T144647#2634619, @bd808 wrote: > Honestly new hosts are spun up so infrequently that could just be managed manually by someone. Done for right now. Simplest th... [16:20:47] 10Browser-Tests-Infrastructure, 06Release-Engineering-Team, 10MediaWiki-extensions-Examples, 07Documentation, and 5 others: Improve documentation around running/writing (with lots of examples) browser tests - https://phabricator.wikimedia.org/T108108#1512435 (10zeljkofilipin) a:05zeljkofilipin>03None [16:24:04] RECOVERY - Puppet run on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [16:37:41] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [16:45:06] PROBLEM - Puppet run on deployment-db1 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:45:56] mornin' releng folks [16:46:05] mw-install-sqlite.sh is failing [16:46:10] https://integration.wikimedia.org/ci/job/parsoidsvc-hhvm-parsertests-jessie/675/console [16:46:15] https://integration.wikimedia.org/ci/job/parsoidsvc-hhvm-parsertests-jessie/674/console [16:47:07] hasharAway ^^ [16:47:14] 16:34:05 [Mon Sep 19 16:34:04 2016] [hphp] [1421:7f4c3061b100:0:000001] [] Unable to set ResourceLimit.CoreFileSize to 8589934592: Operation not permitted (1) [16:49:55] [Mon Sep 19 16:34:06 2016] [hphp] [1421:7f4c3061b100:0:000002] [] Exception handler threw an object exception: [16:50:08] I have no idea why mediawiki Services are disabled. [16:50:14] Maybe to do with db? [16:51:45] 10Beta-Cluster-Infrastructure, 10RESTBase, 06Services: Beta cluster RESTbase not getting new revisions(?), so "Error loading data from server: HTTP 504" in VE - https://phabricator.wikimedia.org/T146053#2649053 (10Jdforrester-WMF) [16:52:01] 10Beta-Cluster-Infrastructure, 10RESTBase, 06Services: Beta cluster RESTbase not getting new revisions(?), so "Error loading data from server: HTTP 504" in VE - https://phabricator.wikimedia.org/T146053#2649041 (10Jdforrester-WMF) p:05Triage>03High [16:52:57] (03Abandoned) 10Jforrester: Finish removing MoodBar, including nl.wikipedia [tools/release] - 10https://gerrit.wikimedia.org/r/303575 (https://phabricator.wikimedia.org/T131340) (owner: 10Nemo bis) [16:53:08] (03Restored) 10Jforrester: Finish removing MoodBar, including nl.wikipedia [tools/release] - 10https://gerrit.wikimedia.org/r/303575 (https://phabricator.wikimedia.org/T131340) (owner: 10Nemo bis) [17:04:18] 10Beta-Cluster-Infrastructure, 06Labs: Please raise quota for deployment-prep - https://phabricator.wikimedia.org/T145611#2635940 (10Andrew) This increase sounds fine to me. [17:04:25] 10Beta-Cluster-Infrastructure, 06Labs: Request increased quota for deployment-prep labs project - https://phabricator.wikimedia.org/T145636#2636577 (10Andrew) Yep, increase is fine with me. [17:07:02] 10Beta-Cluster-Infrastructure, 10RESTBase, 06Services, 15User-mobrovac: Beta cluster RESTbase not getting new revisions(?), so "Error loading data from server: HTTP 504" in VE - https://phabricator.wikimedia.org/T146053#2649125 (10mobrovac) a:03mobrovac ``` Error: getaddrinfo ENOTFOUND deployment-mediawi... [17:12:42] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:13:40] RECOVERY - Puppet run on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:16:16] 10Beta-Cluster-Infrastructure, 10RESTBase, 06Services, 15User-mobrovac: Beta cluster RESTbase not getting new revisions(?), so "Error loading data from server: HTTP 504" in VE - https://phabricator.wikimedia.org/T146053#2649189 (10AlexMonk-WMF) a:05mobrovac>03AlexMonk-WMF I just ran `sudo service restb... [17:19:33] PROBLEM - Puppet run on deployment-apertium01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:20:05] RECOVERY - Puppet run on deployment-db1 is OK: OK: Less than 1.00% above the threshold [0.0] [17:20:47] 10Beta-Cluster-Infrastructure, 10RESTBase, 06Services, 15User-mobrovac: Beta cluster RESTbase not getting new revisions(?), so "Error loading data from server: HTTP 504" in VE - https://phabricator.wikimedia.org/T146053#2649211 (10AlexMonk-WMF) Seems it didn't because `modules/restbase/manifests/init.pp` s... [17:23:43] PROBLEM - Puppet run on deployment-mx is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [17:25:39] 10Beta-Cluster-Infrastructure, 10RESTBase, 06Services, 15User-mobrovac: Beta cluster RESTbase not getting new revisions(?), so "Error loading data from server: HTTP 504" in VE - https://phabricator.wikimedia.org/T146053#2649275 (10mobrovac) 05Open>03Resolved >>! In T146053#2649211, @AlexMonk-WMF wrote:... [17:38:16] 03Scap3: Local config deploys should use the target's current version - https://phabricator.wikimedia.org/T145373#2649338 (10thcipriani) p:05Normal>03High [17:41:11] 10Browser-Tests-Infrastructure, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, 15User-zeljkofilipin: CentralNotice: Intermittent unexplained browser test failures - https://phabricator.wikimedia.org/T145718#2649345 (10DStrine) [17:42:19] 03Scap3: DEPLOY_HEAD should be a symbolic ref - https://phabricator.wikimedia.org/T146062#2649353 (10thcipriani) [17:42:39] 03Scap3: DEPLOY_HEAD should be a symbolic ref - https://phabricator.wikimedia.org/T146062#2649367 (10thcipriani) p:05Triage>03Low [17:45:11] 03Scap3: Local config deploys should use the target's current version - https://phabricator.wikimedia.org/T145373#2649380 (10thcipriani) Per today's [[ https://www.mediawiki.org/wiki/Deployment_tooling/Cabal/2016-09-19 | deployment-tooling meeting ]], the easiest path forward here might be to cache `DEPLOY_HEAD`... [17:59:04] 06Release-Engineering-Team (Deployment-Blockers), 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 07Beta-Cluster-reproducible, and 7 others: Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesi... - https://phabricator.wikimedia.org/T145819#2641971 [17:59:23] 06Release-Engineering-Team (Deployment-Blockers), 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 07Beta-Cluster-reproducible, and 7 others: Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesi... - https://phabricator.wikimedia.org/T145819#2649511 [18:03:45] RECOVERY - Puppet run on deployment-mx is OK: OK: Less than 1.00% above the threshold [0.0] [18:12:51] 06Release-Engineering-Team (Deployment-Blockers), 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 07Beta-Cluster-reproducible, and 7 others: Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesi... - https://phabricator.wikimedia.org/T145819#2649573 [18:22:37] 10Beta-Cluster-Infrastructure, 10RESTBase, 06Services, 15User-mobrovac: Beta cluster RESTbase not getting new revisions(?), so "Error loading data from server: HTTP 504" in VE - https://phabricator.wikimedia.org/T146053#2649681 (10Jdforrester-WMF) [18:25:51] 06Release-Engineering-Team, 10Monitoring, 06Operations, 13Patch-For-Review, 07Wikimedia-Incident: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2649707 (10greg) This is really a follow-up item from a wikimedia incident. [18:46:39] (03CR) 10Hashar: [C: 032] Add 2ColConflict extension to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311408 (https://phabricator.wikimedia.org/T145411) (owner: 10Tobias Gritschacher) [18:47:17] (03Merged) 10jenkins-bot: Add 2ColConflict extension to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311408 (https://phabricator.wikimedia.org/T145411) (owner: 10Tobias Gritschacher) [18:47:44] (03CR) 10Hashar: [C: 032] Add ElectronPdfService extension to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311412 (https://phabricator.wikimedia.org/T142201) (owner: 10Tobias Gritschacher) [18:47:48] (03PS2) 10Hashar: Add ElectronPdfService extension to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311412 (https://phabricator.wikimedia.org/T142201) (owner: 10Tobias Gritschacher) [18:47:52] (03CR) 10Hashar: Add ElectronPdfService extension to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311412 (https://phabricator.wikimedia.org/T142201) (owner: 10Tobias Gritschacher) [18:47:54] (03CR) 10Hashar: [C: 032] Add ElectronPdfService extension to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311412 (https://phabricator.wikimedia.org/T142201) (owner: 10Tobias Gritschacher) [18:48:54] (03Merged) 10jenkins-bot: Add ElectronPdfService extension to integration [integration/config] - 10https://gerrit.wikimedia.org/r/311412 (https://phabricator.wikimedia.org/T142201) (owner: 10Tobias Gritschacher) [18:50:27] Yay the new update to grrrit-wm is working, no more i18n bot merges being shown :) Plus npm 2 and node 6 :) [18:51:04] PROBLEM - Puppet run on deployment-zotero01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:52:18] what does it mean for puppet run to be 55.56% above the critical threshold? [18:53:58] (03CR) 10Hashar: "If it is not broken, there is imho no need to update it is there?" [integration/docroot] - 10https://gerrit.wikimedia.org/r/311345 (https://phabricator.wikimedia.org/T109747) (owner: 10Paladox) [18:57:54] ori: no idea [18:58:16] ori: but deployment-prep has a lot of "wrong" puppet failure since a couple weeks ago or so [18:58:20] havent looked into it yet [19:24:53] 06Release-Engineering-Team (Deployment-Blockers), 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 07Beta-Cluster-reproducible, and 7 others: Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesi... - https://phabricator.wikimedia.org/T145819#2649943 [19:26:07] RECOVERY - Puppet run on deployment-zotero01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:32:59] hasharAway: should I file a bug for the above? [19:33:36] !log creating T144951 integration-puppetmaster01 instance using m1.small and debian jessie [19:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [19:33:52] arlolra: I think the install sqlite issue has been fixed... [19:34:02] thanks for the merges hasharAway ! [19:34:04] ok, thanks [19:34:12] https://phabricator.wikimedia.org/T146019 [19:36:39] chasemp: if you are around, could I get your +1 on the Nodepool patch that get rid of listing floating IP ? https://phabricator.wikimedia.org/T145142 [19:36:53] chasemp: could probably get it pushed to apt / upgraded with eu ops in the morning [19:39:52] 10Beta-Cluster-Infrastructure, 06Operations, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2649992 (10chasemp) [19:39:56] 10Beta-Cluster-Infrastructure, 06Labs: Please raise quota for deployment-prep - https://phabricator.wikimedia.org/T145611#2649990 (10chasemp) 05Open>03Resolved a:03chasemp [19:45:12] 10Beta-Cluster-Infrastructure, 06Labs: Request increased quota for deployment-prep labs project - https://phabricator.wikimedia.org/T145636#2650016 (10chasemp) 05Open>03Resolved a:03chasemp should be gtg, there are a few stacked quota bumps for deployment-prep so let me know @fgiunchedi if you get hung u... [19:45:26] 06Release-Engineering-Team, 06Operations, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2650020 (10thcipriani) >>! In T144578#2639586, @hashar wrote: > @mmodell @thcipriani @demon @dduvall can you check mira02 on beta is all fine ?... [19:45:28] 10Beta-Cluster-Infrastructure, 06Labs: Please raise quota for deployment-prep - https://phabricator.wikimedia.org/T145611#2650022 (10hashar) New quotas: | Cores | 171/192 | RAM | 350208/392400 [19:49:14] hasharAway: I commented, probably not the best nodepool specific reviewing but from my angle it's not a concern [19:49:22] !log creating T144951 enabled role::puppetmaster::standalone role on integration-puppetmaster01 [19:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [19:49:32] s/creating // [19:51:24] chasemp: it will be fine :] the code is not called anywhere else but where I have shortcircuited it :) [20:03:17] !log disable puppet across integration project, moving puppetmasters [20:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:03:47] (03CR) 10Paladox: "Well, no it isen't broken but could do with an update. I am not sure if it fixes the bug linked, I think we may have to manually add the c" [integration/docroot] - 10https://gerrit.wikimedia.org/r/311345 (https://phabricator.wikimedia.org/T109747) (owner: 10Paladox) [20:04:13] legoktm hmm, for some reason I can't autocomplete your name here [20:04:21] o.O [20:04:36] my client was having issues after the netsplits and I had to part/rejoin every channel [20:04:44] heh [20:04:49] i can now [20:04:51] PROBLEM - Puppet run on deployment-mediawiki04 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [20:05:17] legoktm: puppet i srunning now [20:05:18] *is [20:05:53] PROBLEM - Puppet run on deployment-redis01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:06:08] legoktm: I guess ^ are all unrelated? [20:07:09] deployment is unrelated... [20:07:26] yeah ok [20:07:57] PROBLEM - Puppet run on deployment-memc05 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:08:45] PROBLEM - Puppet run on deployment-ms-fe01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:08:46] !log reset puppetmaster of integration-puppetmaster01 to be labs puppetmaster [20:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:09:12] yuvipanda: legoktm: for CI make sure to get the cherry picks on puppet.git [20:09:38] specially that one: * 7688f83 (DO NOT SUBMIT) contint: pin firefox to 46 on Trusty [20:09:42] hasharAway: cherry picks are no longer supported, you will lose them all when we do this migration [20:10:08] (just kidding) [20:10:16] ;D [20:10:34] legoktm: ok, the puppetmaster is up [20:10:37] hasharAway: yep, that's on our list :) [20:10:39] be careful with the oldies like me, we might well suffer from an hearth attack! [20:10:41] legoktm: do you have a list of cherry picks already? [20:10:55] the firefox pinning to v46 might well get resolved now. But havent looked at it yet [20:11:06] no, give me a minute [20:11:13] hasharAway i belive that was fixed in a recent firefox update [20:11:18] all patches should be retrievable from Gerrit based on the Change-Id: [20:11:22] legoktm: ok [20:11:27] hasharAway: go sleep! [20:11:31] i read some where about it fixing a driver to do with something i forgot now [20:11:36] oh nice, only 3 chery-picks [20:11:37] f435b59 contint: role for Android testing [20:11:37] 7688f83 (DO NOT SUBMIT) contint: pin firefox to 46 on Trusty [20:11:37] 7ca33f5 ci: Role for running Raita [20:12:03] legoktm: nice [20:12:13] legoktm: I guess we need to pull those from gerrit somehow [20:12:39] I already created patch files, putting them on the new puppetmaster in a minute [20:13:04] legoktm: awesome [20:13:50] yuvipanda: ok, they're in my home dir. Should I apply them to the git checkout now? [20:14:03] legoktm: yup [20:14:06] same place as before [20:14:36] PROBLEM - Puppet run on deployment-redis02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:14:50] yuvipanda: done [20:15:28] legoktm: ok [20:15:33] legoktm: which instance do you wanna switch? [20:15:54] PROBLEM - Puppet run on deployment-logstash2 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:16:56] uh, integration-slave-trusty-1001 [20:17:00] legoktm: ok [20:17:05] (03PS1) 10Paladox: [mediawiki/extensions] Add noop jenkins test [integration/config] - 10https://gerrit.wikimedia.org/r/311497 [20:18:29] PROBLEM - Puppet run on deployment-mediawiki05 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [20:18:49] PROBLEM - Puppet run on deployment-kafka04 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:19:17] legoktm: https://wikitech.wikimedia.org/wiki/Hiera:Integration/host/integration-slave-trusty-1001 [20:19:39] and it'll automatically transition? [20:19:45] :D [20:20:07] legoktm: going to find out ;) [20:20:11] I suppose we need to re-enable puppet there? [20:20:12] ok [20:20:27] !log re-enabled puppet on integration-slave-trusty-1001 [20:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:21:15] (03PS2) 10Paladox: [mediawiki/extensions] Add noop jenkins test [integration/config] - 10https://gerrit.wikimedia.org/r/311497 [20:24:18] yuvipanda: do we just wait now...? [20:24:37] legoktm: I forced a puppet run, am waiting [20:24:44] :D [20:25:02] !log delete /etc/puppet/puppet.conf.d/10-self.conf and /var/lib/puppet/ssl on integration-slave-trusty-1001 [20:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:25:43] legoktm: seems ok. test? [20:26:21] like...run a jenkins job on it? [20:26:26] legoktm: idk [20:26:30] hmm [20:26:42] something to see it isn't totally utterly broken? [20:27:35] * legoktm hacks something up [20:27:59] if it works, here's my plan [20:28:03] oh, jenkins already starte running a job [20:28:12] 10Continuous-Integration-Infrastructure, 07Zuul: Fix Zuul package "postinst called with unknown argument `triggered' - https://phabricator.wikimedia.org/T146084#2650129 (10hashar) [20:28:16] copy /etc/puppet/puppet.conf from that node to all other nodes [20:28:20] rm /etc/puppet/puppet.conf.d/10-self.conf from them all [20:28:25] and same for /var/lib/puppet/ssl [20:28:30] PROBLEM - Puppet run on integration-slave-trusty-1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [0.0] [20:28:30] and that should make them all automatically work [20:28:40] uh ^ [20:28:53] legoktm: was the transient first failure before I rm'd /var/lib/puppet/ssl [20:28:54] works now [20:29:00] these lag by several minutes [20:29:36] yuvipanda: ok, that slave looks fine [20:29:42] ok! [20:30:06] legoktm: ok, so I'm going to do part 1 now (copy puppet.conf to all instances) [20:31:52] ok [20:32:07] we have a working saltmaster thing btw, not sure how you were planning to do it [20:32:42] legoktm: oh, I've been using clush [20:32:50] a lot nicer than salt [20:33:49] legoktm: ok, if we enable puppet again now, it should all 'just wor' [20:33:53] let's try [20:33:55] ok :D [20:34:08] !log copied /etc/puppet/puppet.conf from integration-trusty-slave-1001 to all integration [20:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:34:14] 10Continuous-Integration-Infrastructure, 07Zuul: Fix Zuul package "postinst called with unknown argument `triggered' - https://phabricator.wikimedia.org/T146084#2650160 (10hashar) [20:34:17] !log rm -rf /var/lib/puppet/ssl on all integration nodes [20:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:34:49] I am surprised scap does not have that already [20:34:51] RECOVERY - Puppet run on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0] [20:34:54] scap puppet recycle [20:35:36] yuvipanda: legoktm thank you very much for taking care of the switch to Jessie [20:36:09] legoktm: hang on, something is slightly fucked up [20:36:16] don't enable puppet anywhere [20:36:21] uh ok [20:38:33] RECOVERY - Puppet run on integration-slave-trusty-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [20:40:51] RECOVERY - Puppet run on deployment-redis01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:41:13] PROBLEM - Puppet run on integration-puppetmaster01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [20:41:14] legoktm: sorted out [20:41:57] !log accidentally deleted /var/lib/puppet/ssl on integration-puppetmaster01 as well, causing it to lose keys. Reprovision by pointing to labs puppetmaster [20:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:43:11] !log enable puppet and run on integration-slave-trusty-1003.eqiad.wmflabs [20:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:43:42] RECOVERY - Puppet run on deployment-ms-fe01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:44:32] heh oops [20:44:32] legoktm: works fine now [20:45:04] legoktm: wanna re-enable one by one? [20:45:11] or shall I just mass re-enable? :D [20:46:27] !log re-enable puppet everywhere [20:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [20:47:31] legoktm: ok, I think it'll just run puppet on schedule now and we can watch for failures [20:47:45] legoktm: also right now the puppetmaster itself is on labs puppet. I'm guessing we should change that [20:47:56] RECOVERY - Puppet run on deployment-memc05 is OK: OK: Less than 1.00% above the threshold [0.0] [20:48:08] sorry, was getting more food [20:48:33] mass-enable sounds good :D [20:48:53] hmm, what was the old one set up to do? [20:49:47] 06Release-Engineering-Team, 06Editing-Department, 10Monitoring, 06Operations, 07Wikimedia-Incident: High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090#2650309 (10hashar) [20:50:29] 06Release-Engineering-Team, 10Monitoring, 06Operations, 13Patch-For-Review, 07Wikimedia-Incident: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2481685 (10hashar) Account creation got broken entirely for 18 hours last week despite metrics being available. I have... [20:51:14] RECOVERY - Puppet run on integration-puppetmaster01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:51:57] legoktm: the old one is its own puppetmaster [20:53:42] I guess that makes sense? [20:53:44] legoktm: I'm going to go afk to eat now [20:53:49] RECOVERY - Puppet run on deployment-kafka04 is OK: OK: Less than 1.00% above the threshold [0.0] [20:53:52] legoktm: yeah, I agree. let's try tha too [20:53:58] I'll check that after lunch [20:54:10] legoktm: call me if anything goes wrong within the next 30 mins, won't be checking IRC [20:54:27] ok :) [20:54:28] legoktm: but you are now completely free of the terrible role::puppet::self :) [20:54:34] woo! [20:54:37] RECOVERY - Puppet run on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:54:37] first project to be free of it even [20:54:47] so your puppetmaster code is the same as prod/labs puppetmasters [20:54:48] same for client [20:54:54] not a bastardized copy pasta version [20:55:53] RECOVERY - Puppet run on deployment-logstash2 is OK: OK: Less than 1.00% above the threshold [0.0] [20:57:46] yay :D [20:58:28] RECOVERY - Puppet run on deployment-mediawiki05 is OK: OK: Less than 1.00% above the threshold [0.0] [21:03:30] Jenkins is still not happy .. https://phabricator.wikimedia.org/T146019#2650353 /cc arlolra [21:09:43] PROBLEM - Puppet run on integration-slave-precise-1011 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [21:19:09] 10Beta-Cluster-Infrastructure, 03Scap3: Fixup beta scap3 keyholder problems - https://phabricator.wikimedia.org/T144647#2650420 (10hashar) That is a neat trick! And indeed given a complete list of hostnames it is quite trivial to grab the keys. I am hereby blaming everyone above to eventually have forced me... [21:22:15] PROBLEM - Puppet run on integration-puppetmaster01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [21:25:46] ^ is me [21:29:15] PROBLEM - Puppet run on integration-puppetmaster is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [21:29:46] !log regenerated client certs only on integration-puppetmaster01, seems ok now [21:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [21:29:53] legoktm: ok, I think we can call this done noew [21:32:42] PROBLEM - Puppet run on integration-slave-jessie-1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:33:20] hmmm [21:33:28] yuvipanda: shinken lags behind [21:33:33] pretty sure about that [21:33:47] so might want to come back in like 10 - 15 minutes and let it settle [21:33:56] (or hook to instance to confirm) [21:35:04] I just did on 3 instances [21:35:06] all good [21:35:50] yuvipanda: great! :] [21:35:54] thank you! [21:37:08] yw! now to write documentation [21:37:43] 06Release-Engineering-Team (Deployment-Blockers), 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 07Beta-Cluster-reproducible, and 7 others: Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesi... - https://phabricator.wikimedia.org/T145819#2650487 [21:38:19] krenair we should schedule some time to move deployment-prep over as well [21:38:59] yuvipanda, move puppetmasters? [21:41:16] yeah [21:42:14] RECOVERY - Puppet run on integration-puppetmaster01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:47:45] RECOVERY - Puppet run on integration-slave-jessie-1004 is OK: OK: Less than 1.00% above the threshold [0.0] [21:50:45] yuvipanda, do we need to schedule stuff like that? [21:51:05] Krenair: we don't, just need someone familiar with beta to be around for an hour or so when we do it [21:51:19] and also to let people know, in case they are in the middle of cherry picking stuff while this is going on [21:54:39] yuvipanda, well, I'm familiar with beta [21:55:15] and no one is logged into the puppetmaster (except me, I just logged in to see if anyone else was) [21:56:07] Krenair: hmm, do you have headroom to create another puppetmaster? [21:56:18] can be m1.small right? [21:57:18] oh, existing one is medium [21:57:47] Krenair: should probably be medium then yeah [21:58:13] I don't know where we are in terms of quotas because we just got a couple of bumps but haven't made the instances we requested those for yet [21:59:06] Yippee, build fixed! [21:59:07] Project selenium-PageTriage ยป firefox,beta,Linux,contintLabsSlave && UbuntuTrusty build #150: 09FIXED in 1 min 5 sec: https://integration.wikimedia.org/ci/job/selenium-PageTriage/BROWSER=firefox,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/150/ [21:59:28] PROBLEM - Puppet run on deployment-sca02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [21:59:55] yuvipanda: I'm gonna shutdown the old puppetmaster, and then make a reminder for myself to delete it in a week...does that sound good? [22:00:12] and I'll email qa@ [22:00:23] legoktm: yup [22:01:11] !log shutdown integration-puppetmaster [22:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master [22:02:59] legoktm: the only material difference is that 'restarting the puppetmaster' is now 'service apache2 restart' rather than 'service puppetmaster restart' [22:03:58] ok [22:04:03] I'll mention that in my email [22:05:15] yuvipanda, yeah we'll need another quota bump for that [22:13:53] Krenair: yeah ouch. [22:13:54] ok [22:13:58] Krenair: I wrote up docs https://wikitech.wikimedia.org/wiki/Standalone_puppetmaster [22:14:21] I mean, we could right now [22:14:22] legoktm: another thing I just realized is that there now is a manual step now when you setup a new instance [22:14:31] you need to do 'rm -rf /var/lib/puppet/ssl' [22:14:32] But then I'd be using quota space allocated for something entirely different [22:14:45] yuvipanda: hmm, we should document this somewhere [22:14:46] And we probably can't just terminate a puppetmaster without leaving it shutdown for a week first [22:14:50] Krenair: well, if we can delete the old puppetmaster by the end [22:14:54] legoktm: https://wikitech.wikimedia.org/wiki/Standalone_puppetmaster [22:14:58] oh xD [22:25:04] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T143328#2650712 (10Matanya) [22:34:27] RECOVERY - Puppet run on deployment-sca02 is OK: OK: Less than 1.00% above the threshold [0.0] [22:36:38] 06Release-Engineering-Team, 06Editing-Department, 10Monitoring, 06Operations, 07Wikimedia-Incident: High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090#2650309 (10Tgr) We might want separate api and non-api metrics since they have differ... [22:40:23] 06Release-Engineering-Team, 10Monitoring, 06Operations, 13Patch-For-Review, 07Wikimedia-Incident: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2650747 (10greg) [22:43:55] 06Release-Engineering-Team, 10Monitoring, 06Operations, 07Wikimedia-Incident: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2650768 (10greg) [23:01:16] 06Release-Engineering-Team (Deployment-Blockers), 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 07Beta-Cluster-reproducible, and 7 others: Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesi... - https://phabricator.wikimedia.org/T145819#2650834 [23:46:18] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T143328#2564761 (10thcipriani) I briefly moved group0 to wmf.19 and saw a large spike in the overall fatalmonitor error-rate. I realized that the appserver... [23:46:36] 06Release-Engineering-Team (Deployment-Blockers), 13Patch-For-Review, 05Release: MW-1.28.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T143328#2651007 (10thcipriani) a:05hashar>03thcipriani [23:57:28] 06Release-Engineering-Team (Deployment-Blockers), 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 07Beta-Cluster-reproducible, and 7 others: Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesi... - https://phabricator.wikimedia.org/T145819#2651025