[01:27:39] mobrovac, bd808: Restored the cherry-picks on the puppet repo [01:27:56] using git reflog, thanks for that tip Bryan [01:28:04] specifically git reflog | grep -C50 "reset: moving to origin/production" [01:31:05] Krenair: awesome. thanks for taking care of that [01:40:55] RECOVERY - Puppet run on deployment-db03 is OK: OK: Less than 1.00% above the threshold [0.0] [01:47:36] RECOVERY - Puppet run on deployment-db04 is OK: OK: Less than 1.00% above the threshold [0.0] [01:52:30] RECOVERY - Puppet run on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [01:53:18] RECOVERY - Puppet run on deployment-restbase02 is OK: OK: Less than 1.00% above the threshold [0.0] [01:55:48] looks like there's some other stuff to deal with too [01:56:02] like things relying on lvs config [01:57:57] RECOVERY - Puppet run on deployment-pdfrender is OK: OK: Less than 1.00% above the threshold [0.0] [01:59:25] PROBLEM - Puppet run on deployment-conftool is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [01:59:41] bd808: have you seen https://www.python.org/dev/peps/pep-0479/ ? just re-ran the commit-message-validator tests and it triggered the deprecation warning for that [02:00:11] (03PS1) 10Legoktm: Release 0.4.1 [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/316510 [02:00:20] (03CR) 10Legoktm: [C: 032] Release 0.4.1 [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/316510 (owner: 10Legoktm) [02:00:32] RECOVERY - Puppet run on deployment-mira is OK: OK: Less than 1.00% above the threshold [0.0] [02:00:37] PROBLEM - Puppet run on deployment-lvs-experiment is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [02:00:51] (03Merged) 10jenkins-bot: Release 0.4.1 [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/316510 (owner: 10Legoktm) [02:02:07] (got rid of my deployment-lvs- instances) [02:02:29] RECOVERY - Puppet run on deployment-sca02 is OK: OK: Less than 1.00% above the threshold [0.0] [02:02:41] RECOVERY - Puppet run on deployment-mathoid is OK: OK: Less than 1.00% above the threshold [0.0] [02:02:50] PROBLEM - Host deployment-lvs-realservertest is DOWN: CRITICAL - Host Unreachable (10.68.23.164) [02:05:15] RECOVERY - Puppet run on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [02:06:07] RECOVERY - Puppet run on deployment-tin is OK: OK: Less than 1.00% above the threshold [0.0] [02:07:02] silly xchat [02:07:15] I'm a bit disappointed no one else noticed all the puppet issues around deployment-prep [02:08:45] RECOVERY - Puppet run on deployment-sca03 is OK: OK: Less than 1.00% above the threshold [0.0] [02:09:52] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2724169 (10demon) >>! In T148478#2724184, @Dzahn wrote: > On Oct 13 the log4j.properties were merged in https://gerrit.wikimedia.org/r/#/c/315571/ and there is no... [02:14:20] RECOVERY - Puppet staleness on deployment-ircd is OK: OK: Less than 1.00% above the threshold [3600.0] [02:22:39] RECOVERY - Puppet run on deployment-apertium02 is OK: OK: Less than 1.00% above the threshold [0.0] [02:42:05] Krenair: The releng folks are at an offsite this week so that may explain them not noticiing [02:42:49] I saw you talking about it first thing when I got in irc so I figured it was in good hands :) [02:55:11] (03PS1) 10BryanDavis: Remove StopIteration for PEP 479 compatibility [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/316515 [03:22:00] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2724455 (10Dzahn) >>! In T148478#2724425, @demon wrote: > Ugh, then we need to remove that until we've got acceptable logging config, mea culpa. I made log4j log t... [03:43:00] PROBLEM - Puppet run on repository is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [04:23:00] RECOVERY - Puppet run on repository is OK: OK: Less than 1.00% above the threshold [0.0] [06:22:13] PROBLEM - Puppet run on integration-slave-trusty-1012 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [06:47:22] PROBLEM - Puppet run on integration-slave-precise-1002 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [06:51:11] (03CR) 10Legoktm: [C: 032] Remove StopIteration for PEP 479 compatibility [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/316515 (owner: 10BryanDavis) [06:56:34] (03Merged) 10jenkins-bot: Remove StopIteration for PEP 479 compatibility [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/316515 (owner: 10BryanDavis) [07:02:12] RECOVERY - Puppet run on integration-slave-trusty-1012 is OK: OK: Less than 1.00% above the threshold [0.0] [07:04:06] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [07:22:25] RECOVERY - Puppet run on integration-slave-precise-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [08:25:48] 10Continuous-Integration-Config, 06Operations, 06Operations-Software-Development: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2724601 (10Volans) [08:34:57] 10Continuous-Integration-Config, 06Operations, 06Operations-Software-Development: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2724636 (10ema) p:05Triage>03Normal [09:06:55] PROBLEM - Puppet run on deployment-db03 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [09:41:54] RECOVERY - Puppet run on deployment-db03 is OK: OK: Less than 1.00% above the threshold [0.0] [09:51:33] PROBLEM - Puppet run on deployment-apertium01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [09:52:42] 10Beta-Cluster-Infrastructure, 05Prometheus-metrics-monitoring: Prometheus puppet manifest fail on Trusty instance deployment-zotero1 groupadd: failure while writing changes to /etc/group - https://phabricator.wikimedia.org/T145793#2724706 (10fgiunchedi) 05Resolved>03Open Reopening since we're seeing the s... [10:43:58] PROBLEM - Puppet run on repository is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [11:23:59] RECOVERY - Puppet run on repository is OK: OK: Less than 1.00% above the threshold [0.0] [11:24:15] Hey, is everything okay to merge this? https://gerrit.wikimedia.org/r/#/c/316216/ [11:27:28] 10Beta-Cluster-Infrastructure, 05Prometheus-metrics-monitoring: Prometheus puppet manifest fail on Trusty instance deployment-zotero1 groupadd: failure while writing changes to /etc/group - https://phabricator.wikimedia.org/T145793#2724901 (10akosiaris) That part fixed. Sorry, I forgot to do the upgrade to lin... [11:44:16] 10Beta-Cluster-Infrastructure, 05Prometheus-metrics-monitoring: Prometheus puppet manifest fail on Trusty instance deployment-zotero1 groupadd: failure while writing changes to /etc/group - https://phabricator.wikimedia.org/T145793#2724909 (10fgiunchedi) 05Open>03Resolved all done! thanks @akosiaris, I've... [11:53:14] PROBLEM - Puppet run on integration-slave-trusty-1012 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:54:47] 06Release-Engineering-Team, 07Documentation, 07Easy: Merge Wikimedia's "Deployment checklist for new extensions" doc pages - https://phabricator.wikimedia.org/T142081#2724931 (10Aklapper) @Mayank.jindal5: Just checking: Anything we can help with? (Do you still plan to work on this?) [11:55:17] 06Release-Engineering-Team, 07Documentation, 07Easy: Merge Wikimedia's "Deployment checklist for new extensions" doc pages - https://phabricator.wikimedia.org/T142081#2724932 (10Aklapper) p:05Triage>03Low [12:18:41] 06Release-Engineering-Team, 07Documentation, 07Easy: Merge Wikimedia's "Deployment checklist for new extensions" doc pages - https://phabricator.wikimedia.org/T142081#2724945 (10Aklapper) * Writing_an_extension_for_deployment suffers from trying to cover very different things on one page: There's the generic... [12:33:12] RECOVERY - Puppet run on integration-slave-trusty-1012 is OK: OK: Less than 1.00% above the threshold [0.0] [12:59:01] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2725033 (10elukey) p:05Triage>03Normal [13:07:04] 03Scap3, 06Services (watching), 15User-mobrovac: Scap3 fails to restart the service on deploy - https://phabricator.wikimedia.org/T148407#2725072 (10mobrovac) 05Open>03Invalid Sorry, wrong report. It turns out scap versions on `deployment-tin` and the target mismatched due to Puppet not being able to run... [13:09:23] 10Continuous-Integration-Config, 06Front-end-Standards-Group: Consider moving from npm to yarn for WMF repos? - https://phabricator.wikimedia.org/T148230#2725076 (10hashar) I have learned about yarn just last week from Nuria / Analytics team. Would you mind reaching out to them and check were they are heading... [13:23:40] 10Beta-Cluster-Infrastructure: New wiki cluster wikipedia indonesian language - https://phabricator.wikimedia.org/T143557#2725106 (10Mbrt) 05declined>03Open [14:26:56] 10Beta-Cluster-Infrastructure: New wiki cluster wikipedia indonesian language - https://phabricator.wikimedia.org/T143557#2725273 (10Aklapper) 05Open>03declined Please do not change the task status with arguments, as reasons have been provided why this request is declined. If something is still unclear, plea... [14:46:36] PROBLEM - Puppet run on zuul-dev-jessie is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [14:54:30] 06Release-Engineering-Team, 06Operations, 07Beta-Cluster-reproducible, 13Patch-For-Review: mwscript on jessie mediawiki fails - https://phabricator.wikimedia.org/T146286#2725337 (10elukey) p:05Triage>03Normal [15:39:49] PROBLEM - Puppet run on deployment-phab02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:42:51] Yippee, build fixed! [15:42:51] Project selenium-MobileFrontend ยป chrome,beta,Linux,contintLabsSlave && UbuntuTrusty build #197: 09FIXED in 20 min: https://integration.wikimedia.org/ci/job/selenium-MobileFrontend/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=contintLabsSlave%20&&%20UbuntuTrusty/197/ [15:51:04] PROBLEM - Puppet run on deployment-phab01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:28:49] PROBLEM - Puppet run on deployment-imagescaler01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [16:43:49] RECOVERY - Puppet run on deployment-imagescaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:18:42] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2726398 (10Paladox) Gerrit become slow again on 18/10/2016 around 6:50pm to 7:10pm bst time. [18:19:14] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2726403 (10Paladox) [18:20:53] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2724169 (10Paladox) [18:28:59] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2726435 (10Dzahn) ack, it became slow but this time i did not restart the service and it was just working again after a little while. in the logs we have now, i se... [18:31:03] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2726455 (10Dzahn) >>! In T148478#2726398, @Paladox wrote: > Gerrit become slow again on 18/10/2016 around 6:50pm to 7:10pm bst time. Let's talk in UTC time. Like t... [18:31:49] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2726457 (10Paladox) The problem started on 17:06pm utc which was reported by @Andrew last report was 18:21 pm utc. [18:33:22] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2726478 (10Paladox) >>! In T148478#2726435, @Dzahn wrote: > ack, it became slow but this time i did not restart the service and it was just working again after a li... [18:55:07] ostriches would something like https://github.com/cmoulliard/gerrit-create-adminuser-plugin/blob/master/change-project-config/config/log4j.properties give us more gerrit log info then currently? [18:56:23] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726582 (10Paladox) [18:57:32] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2724169 (10Paladox) [18:57:44] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2724169 (10Paladox) [19:32:11] 10Deployment-Systems, 06Release-Engineering-Team, 06Operations: Incomplete /srv/mediawiki-staging state on deployment servers - https://phabricator.wikimedia.org/T148571#2726717 (10demon) [19:35:19] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726725 (10Paladox) The error described by @dzahn is fixed here https://gerrit-review.googlesource.com/#/c/87435/ and will probably be released in gerr... [19:36:25] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726726 (10demon) That's also not what caused a slowdown. We've been suffering from that ever since we upgraded past 2.8.x [19:36:39] 10Deployment-Systems, 06Release-Engineering-Team, 06Operations: Incomplete /srv/mediawiki-staging state on deployment servers - https://phabricator.wikimedia.org/T148571#2726593 (10Dzahn) Restoring /srv/mediawiki-staging from Bacula backup, source mira, destination mira /srv/mediawiki-restore, job is current... [19:40:05] paladox: Don't know, don't care today. [19:40:27] Way too busy with real problems. [19:40:53] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726747 (10Paladox) @demon we could disable gc? And return to cron? [19:40:57] Ok [19:41:43] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726750 (10demon) That doesn't even make sense. I'm talking about the JVM garbage collection. [19:42:14] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726753 (10Paladox) Oh, carn't we disable that? [19:42:31] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726754 (10demon) No. [19:46:38] 10Gerrit, 06Release-Engineering-Team, 06Operations: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2726768 (10Paladox) Can we setup heap to catch the crash so that we get some type of log that it's jvm gc? http://stackoverflow.com/questions/35262175... [20:09:05] 10Deployment-Systems, 06Release-Engineering-Team, 06Operations: Incomplete /srv/mediawiki-staging state on deployment servers - https://phabricator.wikimedia.org/T148571#2726872 (10Dereckson) Current status: @demon will need some time to ensure the staging folder and everything are sane again. Meanwhile, de... [20:10:19] (03CR) 10Andrew Bogott: [C: 031] Puppet doc now ignore /bin files [integration/config] - 10https://gerrit.wikimedia.org/r/309332 (https://phabricator.wikimedia.org/T143233) (owner: 10Hashar) [20:14:59] PROBLEM - Puppet run on repository is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [20:18:18] (03PS2) 10Andrew Bogott: Puppet doc now ignore /bin files [integration/config] - 10https://gerrit.wikimedia.org/r/309332 (https://phabricator.wikimedia.org/T143233) (owner: 10Hashar) [20:32:24] we have an issue with matching phabricator milestone tags with the wikibugs irc bot [20:32:42] is there anybody here who dealt with that issue before / has pointers? [20:33:33] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:36:11] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:49:59] RECOVERY - Puppet run on repository is OK: OK: Less than 1.00% above the threshold [0.0] [21:10:31] gwicke: what's the issue? [21:10:39] gwicke: or you can file a bug in #wikibugs [21:15:58] PROBLEM - Puppet run on repository is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [21:29:01] (03PS3) 10Paladox: Puppet doc now ignore /bin files [integration/config] - 10https://gerrit.wikimedia.org/r/309332 (https://phabricator.wikimedia.org/T143233) (owner: 10Hashar) [21:30:20] (03CR) 10Paladox: "Rebased." [integration/config] - 10https://gerrit.wikimedia.org/r/309332 (https://phabricator.wikimedia.org/T143233) (owner: 10Hashar) [21:30:22] (03CR) 10Paladox: [C: 031] Puppet doc now ignore /bin files [integration/config] - 10https://gerrit.wikimedia.org/r/309332 (https://phabricator.wikimedia.org/T143233) (owner: 10Hashar) [21:44:56] mobrovac_: i'm on node 0.10 now in my vagrant. i hear you committed a fix to bring it back to 4.x, but i can't figure out how to get it to update [21:47:44] oh wait, provisioning probably looks at those dependencies.. [21:49:59] ah bingo [21:50:59] RECOVERY - Puppet run on repository is OK: OK: Less than 1.00% above the threshold [0.0] [22:06:43] 10Beta-Cluster-Infrastructure, 06Community-Tech, 10MediaWiki-extensions-CentralAuth: Add indices for local_user_id and global_user_id in Beta - https://phabricator.wikimedia.org/T148239#2727238 (10kaldari) [22:13:25] PROBLEM - Puppet run on integration-slave-precise-1002 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [22:46:32] 10Deployment-Systems, 06Release-Engineering-Team, 06Operations: Incomplete /srv/mediawiki-staging state on deployment servers - https://phabricator.wikimedia.org/T148571#2727314 (10demon) Ok, tin/mira are back in use and look sane. Doing a final scap to get everything back in sync, but we should be good to g... [22:53:25] RECOVERY - Puppet run on integration-slave-precise-1002 is OK: OK: Less than 1.00% above the threshold [0.0]