[00:12:36] <wikibugs>	 10Beta-Cluster-Infrastructure, 10MediaWiki-DjVu: DjVu rendering broken on Beta - https://phabricator.wikimedia.org/T117132#1767552 (10Bawolff) same issue with tiffs and missing libvips.
[01:07:02] <wikibugs>	 10Beta-Cluster-Infrastructure: +Sysop for User:Mww113 - https://phabricator.wikimedia.org/T116364#1767683 (10Mww113) Thank you
[03:08:23] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<20.00%)
[05:51:29] <wikibugs>	 6Release-Engineering-Team, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1768017 (10Tgr) >>! In T102566#1761834, @saper wrote: > Question: wouldn't that be possible to ship the certifi...
[06:00:37] <grrrit-wm>	 (03PS1) 10Awight: Skip PHPUnit tests on all deployment branches [integration/config] - 10https://gerrit.wikimedia.org/r/249968 (https://phabricator.wikimedia.org/T117062) 
[06:01:43] <wikibugs>	 10Continuous-Integration-Config, 10Fundraising-Backlog, 10Unplanned-Sprint-Work, 3Fundraising Sprint William Shatner, 5Patch-For-Review: Tests on deployment branches of wikimedia/fundraising/crm falling causing to force merge (and deadlock of Zuul) - https://phabricator.wikimedia.org/T117062#1768025 (10aw...
[06:38:21] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK
[09:06:33] <zeljkof>	 hashar: meeting reminder :)
[09:07:08] <hashar>	 zeljkof: ah I thought you were still sick sorry
[09:21:52] <hashar>	 net breaking :/
[09:30:37] <shinken-wm>	 PROBLEM - Puppet failure on deployment-fluorine is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[09:38:28] <wikibugs>	 10Differential, 5Gerrit-Migration: Differential notification emails lack headers for repository and state change - https://phabricator.wikimedia.org/T117186#1768274 (10hashar) 3NEW
[09:53:48] <wikibugs>	 10Beta-Cluster-Infrastructure, 10MediaWiki-DjVu: DjVu rendering broken on Beta - https://phabricator.wikimedia.org/T117132#1768318 (10hashar)
[09:53:50] <wikibugs>	 10Beta-Cluster-Infrastructure: beta cluster missing vips command needed to render tiffs and pngs - https://phabricator.wikimedia.org/T116816#1768319 (10hashar)
[09:57:26] <shinken-wm>	 PROBLEM - Host deployment-cache-parsoid04 is DOWN: CRITICAL - Host Unreachable (10.68.19.197)
[09:59:54] <wikibugs>	 5Gerrit-Migration, 10Diffusion, 6Labs, 10Tool-Labs: Figure out a git hosting solution for tools - https://phabricator.wikimedia.org/T117071#1768338 (10hashar)
[10:03:33] <wikibugs>	 5Gerrit-Migration, 10Diffusion, 6Labs, 10Tool-Labs: Figure out a git hosting solution for tools - https://phabricator.wikimedia.org/T117071#1768348 (10hashar) The Wikimedia git hosting solution is Gerrit.  The aim is to replace Gerrit with Differential, a solution which is embedded in Phabricator and track...
[10:18:37] <wikibugs>	 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 7Documentation: Create a Jenkins check to verify hooks.yaml formatting - https://phabricator.wikimedia.org/T116965#1768361 (10hashar) Indeed we have a bunch of them scattered around in multiple extensions:  | OpenStackManager | Spyc.php | ht...
[10:20:40] <wikibugs>	 6Release-Engineering-Team, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1768363 (10saper) Right, might be difficult due to the way OpenSSL usually wants to have certificates. Will try...
[10:20:44] <wikibugs>	 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 7Documentation: Create a Jenkins check to verify hooks.yaml formatting - https://phabricator.wikimedia.org/T116965#1768364 (10hashar) @bd808 pecl extension is: http://bd808.com/pecl-file_formats-yaml/ which uses libyaml C bindings.
[10:21:35] <wikibugs>	 10Continuous-Integration-Config, 10Fundraising-Backlog, 10Unplanned-Sprint-Work, 3Fundraising Sprint William Shatner, and 2 others: Tests on deployment branches of wikimedia/fundraising/crm falling causing to force merge (and deadlock of Zuul) - https://phabricator.wikimedia.org/T117062#1768365 (10hashar)
[10:22:55] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] "We will see what happens :-)" [integration/config] - 10https://gerrit.wikimedia.org/r/249968 (https://phabricator.wikimedia.org/T117062) (owner: 10Awight)
[10:24:06] <grrrit-wm>	 (03Merged) 10jenkins-bot: Skip PHPUnit tests on all deployment branches [integration/config] - 10https://gerrit.wikimedia.org/r/249968 (https://phabricator.wikimedia.org/T117062) (owner: 10Awight)
[10:37:33] <aharoni>	 zeljkof: hi
[10:41:44] <wikibugs>	 10Continuous-Integration-Config, 10MediaWiki-General-or-Unknown, 10MediaWiki-extensions-General-or-Unknown: Add grunt-concurrent to mediawiki/core and decide whether to add it to extensions (Improve grunt performance) - https://phabricator.wikimedia.org/T116988#1768397 (10hashar) The slowdown is usually due...
[10:53:33] <aharoni>	 hashar: https://phabricator.wikimedia.org/T95892
[10:53:37] <aharoni>	 is there anything left to do?
[10:54:02] <aharoni>	 Is ContentTranslation still different from other extensions in any regard? It should be the same.
[10:55:18] <hashar>	 aharoni: do you have a PHPUnit test that lint the YAML files ?
[10:55:44] <aharoni>	 I don't think so. Where do I add it?
[10:55:50] <aharoni>	 How is it done in other extensions?
[10:56:18] <aharoni>	 It's .rubocop.yml, so it should be the same as all the other extensions that have .rubocop.yml.
[10:56:28] <aharoni>	 hashar: ^
[11:11:10] <wikibugs>	 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Phase out yamllint jobs (tracking) - https://phabricator.wikimedia.org/T95890#1768437 (10Amire80) [12:53] <aharoni> hashar: https://phabricator.wikimedia.org/T95892 [12:53] <aharoni> is there anything left to do? [12:54] <aharoni> Is ContentTranslatio...
[11:11:59] <wikibugs>	 5Gerrit-Migration, 10Diffusion, 6Labs, 10Tool-Labs: Figure out a git hosting solution for tools/kubernetes - https://phabricator.wikimedia.org/T117071#1768440 (10yuvipanda)
[11:15:23] <wikibugs>	 5Gerrit-Migration, 10Diffusion, 6Labs, 10Tool-Labs: Figure out a git hosting solution for tools/kubernetes - https://phabricator.wikimedia.org/T117071#1768447 (10yuvipanda) I think you're missing a lot of context here @hashar. This is a ticket that is gathering requirements for what sort of git setup we'd...
[11:27:34] <hashar>	 aharoni: sorry disconnected
[11:27:55] <aharoni>	 hashar: np
[11:27:58] <hashar>	 aharoni: rubocop  is a style checker for ruby language. So it has nothing to do with linting YAML files :-D
[11:27:59] <aharoni>	 so do you have an example?
[11:28:05] <hashar>	 nop
[11:28:20] <hashar>	 if ContentTranslation has Spyc or some other YAML parser
[11:28:28] <aharoni>	 hashar: yes, but rubocop's own config file is a yaml file
[11:28:32] <hashar>	 you can write a PHPUnit test that look for .yaml files and attempt to parse them
[11:28:38] <aharoni>	 is it checked by anything?
[11:28:41] <hashar>	 this way if the .yaml file is wrong, the test will break
[11:28:42] <aharoni>	 does it have to be checked?
[11:28:50] <hashar>	 I don't think
[11:28:58] <hashar>	 if it is invalid, I guess rubocop will fail :-}
[11:29:40] <hashar>	 hm
[11:29:49] <aharoni>	 hashar: so this task can probably be closed, because there's no other yaml to check in ContentTranslation, and I don't expect there to be any.
[11:29:58] <hashar>	 yeah 
[11:30:08] <hashar>	 I am not sure why we had a yamllint job on that repo :}
[11:30:26] <aharoni>	 maybe copy-and-paste
[11:30:28] <aharoni>	 ok thanks
[11:30:48] <hashar>	 yup :)
[11:31:20] <wikibugs>	 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Phase out yamllint jobs (tracking) - https://phabricator.wikimedia.org/T95890#1768464 (10hashar)
[11:31:22] <hashar>	 marked as resolved :)
[11:31:32] <hashar>	 aharoni: sorry for the mess
[11:31:43] <aharoni>	 hashar: no problem at all, thanks for the help.
[11:31:51] <wikibugs>	 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Phase out yamllint jobs (tracking) - https://phabricator.wikimedia.org/T95890#1768467 (10Amire80)
[11:32:21] <wikibugs>	 10Beta-Cluster-Infrastructure: beta cluster missing vips command needed to render tiffs and pngs, pnmtojpeg for DjVus - https://phabricator.wikimedia.org/T116816#1768469 (10Rillke)
[11:35:51] <hashar>	 !log have varnish stats collector to emit to labmon1001.eqiad.wmnet  instead of production statsd ( cherry picked https://gerrit.wikimedia.org/r/#/c/249490/ )
[11:35:57] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[11:56:03] <jzerebecki>	 zeljkof: https://phabricator.wikimedia.org/T116164
[13:32:26] <wikibugs>	 10MediaWiki-Codesniffer, 3Outreachy-Round-11: Outreachy proposal for : Improving static analysis tools for MediaWiki - https://phabricator.wikimedia.org/T115585#1768654 (1001tonythomas) We are approaching the Outreachy'11 application deadline, and if you want to have your proposal considered to be part of this...
[13:33:27] <wikibugs>	 6Release-Engineering-Team, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1768664 (10BBlack) It really shouldn't be hard for an OS/distribution/platform/language/whatever to have workin...
[13:51:52] <wikibugs>	 5Continuous-Integration-Scaling, 7Tracking: Puppetize / play out with devpi - https://phabricator.wikimedia.org/T117207#1768741 (10hashar) 3NEW
[14:00:21] <wikibugs>	 5Continuous-Integration-Scaling, 5Patch-For-Review, 7Tracking: Puppetize / play out with devpi - https://phabricator.wikimedia.org/T117207#1768774 (10hashar) Applied `role::ci::pmcache` to `pmcache.integration.eqiad.wmflabs`.
[14:06:58] <shinken-wm>	 PROBLEM - Puppet failure on pmcache is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[14:14:20] <wikibugs>	 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: Evaluate devpi for caching Pypi python packages - https://phabricator.wikimedia.org/T114871#1768810 (10hashar) Integration being done via T117207
[14:16:50] <shinken-wm>	 RECOVERY - Puppet failure on pmcache is OK: OK: Less than 1.00% above the threshold [0.0]
[14:42:46] <grrrit-wm>	 (03PS1) 10Hashar: (WIP) use pmcache has a pypi cache [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) 
[14:47:37] <grrrit-wm>	 (03PS2) 10Hashar: (WIP) use pmcache has a pypi cache [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) 
[14:47:49] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] (WIP) use pmcache has a pypi cache [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) (owner: 10Hashar)
[14:50:14] <grrrit-wm>	 (03PS3) 10Hashar: (WIP) use pmcache has a pypi cache [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) 
[14:53:11] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] (WIP) use pmcache has a pypi cache [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) (owner: 10Hashar)
[14:54:21] <grrrit-wm>	 (03CR) 10Hashar: "recheck" [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) (owner: 10Hashar)
[14:58:26] <grrrit-wm>	 (03PS4) 10Hashar: (WIP) use pmcache has a pypi cache [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) 
[14:59:57] <grrrit-wm>	 (03CR) 10Hashar: "rebuild" [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) (owner: 10Hashar)
[15:00:08] * hashar whistles
[15:00:25] <grrrit-wm>	 (03CR) 10Hashar: "recheck" [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) (owner: 10Hashar)
[15:00:39] * hashar feels tired
[15:01:18] * greg-g waves
[15:01:54] <grrrit-wm>	 (03CR) 10Hashar: "recheck" [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) (owner: 10Hashar)
[15:09:27] <wikibugs>	 6Release-Engineering-Team, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1768954 (10Nemo_bis) >  This almost feels like trying to ship a bundled TCP/IP stack and set of ethernet hardwa...
[15:15:32] <grrrit-wm>	 (03PS1) 10Hashar: rake-jessie: archive tox env logs [integration/config] - 10https://gerrit.wikimedia.org/r/250018 
[15:15:52] <grrrit-wm>	 (03PS2) 10Hashar: tox-jessie: archive tox env logs [integration/config] - 10https://gerrit.wikimedia.org/r/250018 
[15:17:02] <grrrit-wm>	 (03CR) 10Hashar: "recheck" [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T114871) (owner: 10Hashar)
[15:17:24] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] tox-jessie: archive tox env logs [integration/config] - 10https://gerrit.wikimedia.org/r/250018 (owner: 10Hashar)
[15:17:43] <grrrit-wm>	 (03PS3) 10Hashar: tox-jessie: archive tox env logs [integration/config] - 10https://gerrit.wikimedia.org/r/250018 
[15:20:24] <ostriches>	 thcipriani: post-checkout maybe?
[15:20:28] * ostriches ponders for a moment
[15:22:01] <shinken-wm>	 RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 1.53 ms
[15:23:34] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] "Helpful to find out potential slowdown of a build. For integration/config I have spotted:" [integration/config] - 10https://gerrit.wikimedia.org/r/250018 (owner: 10Hashar)
[15:24:08] <thcipriani>	 ostriches: not entirely sure. I think the key is that the first time this code is run, the same time we setup the git hook, the first git-related thing happening currently will be calling `git fetch` from a target.
[15:25:26] <grrrit-wm>	 (03Merged) 10jenkins-bot: tox-jessie: archive tox env logs [integration/config] - 10https://gerrit.wikimedia.org/r/250018 (owner: 10Hashar)
[15:29:00] <ostriches>	 thcipriani: And the clients fetching are dumb?
[15:30:02] <thcipriani>	 yeah, happening over http, so my assumption is that the first thing git-related that happens currently is a GET request for .git/info/refs
[15:30:11] * ostriches nods
[15:37:28] <grrrit-wm>	 (03PS5) 10Hashar: (WIP) use pmcache has a pypi cache [integration/config] - 10https://gerrit.wikimedia.org/r/250009 (https://phabricator.wikimedia.org/T117207) 
[15:38:50] <wikibugs>	 5Continuous-Integration-Scaling, 5Patch-For-Review, 7Tracking: Puppetize / play out with devpi - https://phabricator.wikimedia.org/T117207#1769069 (10hashar) Playing with it at https://gerrit.wikimedia.org/r/#/c/250009/  Before running tox, we will need to export a couple variables, probably: ``` export PIP_...
[15:39:39] <wikibugs>	 5Continuous-Integration-Scaling, 5Patch-For-Review, 7Tracking: Puppetize / play out with devpi - https://phabricator.wikimedia.org/T117207#1769072 (10hashar) a:3hashar
[15:41:06] <wikibugs>	 5Continuous-Integration-Scaling: Package Zuul for Debian Jessie - https://phabricator.wikimedia.org/T117223#1769087 (10hashar) 3NEW a:3hashar
[15:42:35] <wikibugs>	 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 7Documentation: Create a Jenkins check to verify hooks.yaml formatting - https://phabricator.wikimedia.org/T116965#1769101 (10bd808) >>! In T116965#1768364, @hashar wrote: > @bd808 pecl extension is: http://bd808.com/pecl-file_formats-yaml/...
[15:44:19] <wikibugs>	 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 7Documentation: Create a Jenkins check to verify hooks.yaml formatting - https://phabricator.wikimedia.org/T116965#1769106 (10bd808) If we wanted to get really fancy we could make a Wikimedia library that wraps both symfony/yaml and the yaml...
[15:44:39] <hashar>	 bd808: you are lovely :-}
[15:44:46] <ostriches>	 thcipriani: Bahaha, I figured it out :p
[15:44:49] <hashar>	 wikimedia/ultimate-yaml
[15:45:18] <thcipriani>	 ostriches: ?
[15:45:21] <ostriches>	 Use post-update hook, create a remote "self" and push to yourself to trigger it :p
[15:45:30] <ostriches>	 Which seems totally unworth the labor compared to what we do now.
[15:45:42] <ostriches>	 There doesn't seem to be a post-tag hook of any sort.
[15:45:49] <ostriches>	 And we can't ensure a checkout has happened.
[15:46:07] <ostriches>	 (presumably it did, but it might not)
[15:48:00] <thcipriani>	 tagging would be nice thing to hooks to that I didn't think about, bummer that there's not a hook there.
[15:48:27] <thcipriani>	 s/hooks/hook/
[15:49:04] <ostriches>	 Yeah and post-commit doesn't run on tagging...I tried
[15:49:40] <thcipriani>	 hmm, reading the git-scm docs this seemed like the right thing to do, but maybe not for the weird thing we're trying to do.
[15:50:31] <ostriches>	 It assumes a push to a central prior to a pull really for that scenario
[15:51:43] <thcipriani>	 yeah, and presumably, there is a checkout that happens prior to a deploy, but in the first instance that would be pre our ability to set hooks.
[15:52:10] <ostriches>	 Yerp
[15:52:28] <ostriches>	 So, I think I'm gonna abandon D24 and its associated task. What we're doing now works, and we can't hook it better.
[15:53:57] <thcipriani>	 ostriches: kk, sounds right, thanks for looking into it.
[15:54:11] <ostriches>	 np
[15:55:09] <ostriches>	 Woulda been nice to avoid a shell out :)
[15:57:32] <thcipriani>	 indeed.
[16:28:40] <wikibugs>	 10Beta-Cluster-Infrastructure, 6Collaboration-Team-Backlog, 10Flow, 3Collaboration-Team-Current: Beta Cluster Special:Contributions lags by a long time and notes slow Flow queries - https://phabricator.wikimedia.org/T78671#1769274 (10matthiasmullie)
[16:39:59] <grrrit-wm>	 (03PS2) 10JanZerebecki: [WIP] Run Wikidata browsertests without saucelabs [integration/config] - 10https://gerrit.wikimedia.org/r/247901 (https://phabricator.wikimedia.org/T116166) 
[16:42:06] <wikibugs>	 10Browser-Tests, 10Wikidata, 7Easy, 5Patch-For-Review: move wikidata browsertests to not use saucelabs - https://phabricator.wikimedia.org/T116166#1769311 (10JanZerebecki) Yes that seems to disable saucelabs. But now I get "unable to obtain stable firefox connection in 60 seconds" as a failure for the seco...
[16:58:22] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-db2 is CRITICAL: CRITICAL: deployment-prep.deployment-db2.diskspace._mnt.byte_percentfree (<11.11%)
[17:00:53] <wikibugs>	 10Browser-Tests, 10Wikidata, 7Easy, 5Patch-For-Review: move wikidata browsertests to not use saucelabs - https://phabricator.wikimedia.org/T116166#1769396 (10zeljkofilipin) I think the only thing left is to tell the machine you want to run the browser headless.  ``` export HEADLESS=true ```  https://github...
[17:12:13] <wikibugs>	 10Browser-Tests, 10Wikidata, 7Easy, 5Patch-For-Review: move wikidata browsertests to not use saucelabs - https://phabricator.wikimedia.org/T116166#1769439 (10JanZerebecki) Yes that seems to work.
[17:13:01] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:13:07] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:15:13] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:15:46] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:15:46] <shinken-wm>	 PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:28:05] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 38610 bytes in 7.529 second response time
[17:30:03] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 38731 bytes in 0.939 second response time
[17:30:35] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 30350 bytes in 0.459 second response time
[17:30:37] <shinken-wm>	 RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 39021 bytes in 0.468 second response time
[17:32:53] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 38731 bytes in 0.705 second response time
[17:33:49] <grrrit-wm>	 (03PS1) 10Ori.livneh: Update obsolete references to wikiversions.cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/250043 
[17:36:13] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:36:44] <shinken-wm>	 PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:36:44] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:37:14] <shinken-wm>	 PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145)
[17:38:11] <greg-g>	 meh?
[17:39:04] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 8.430 second response time
[17:39:08] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:39:40] <wikibugs>	 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team: Beta Cluster outage: deployment-db2 disk filled up, locked db replication - https://phabricator.wikimedia.org/T116447#1769612 (10Catrope) Looks like this is the same query as the one called out at {T78671}, which is the one I suspected.
[17:40:43] <wikibugs>	 10Browser-Tests, 10Wikidata, 5Patch-For-Review, 3Wikidata-Sprint-2015-10-13: [Bug] fix negative argument (ArgumentError) in browsertests - https://phabricator.wikimedia.org/T110510#1769625 (10zeljkofilipin) a:5zeljkofilipin>3None
[17:41:07] <grrrit-wm>	 (03CR) 10Chad: [C: 032] Update obsolete references to wikiversions.cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/250043 (owner: 10Ori.livneh)
[17:41:48] <grrrit-wm>	 (03Merged) 10jenkins-bot: Update obsolete references to wikiversions.cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/250043 (owner: 10Ori.livneh)
[17:42:58] <MatmaRex>	 is beta down? http://commons.wikimedia.beta.wmflabs.org/ doesn't load for me
[17:42:58] <MatmaRex>	 ---------------------------------------------------------------------------------------
[17:43:10] <MatmaRex>	 (whoops, bad paste)
[17:43:16] <Krenair>	 could've been worse :P
[17:43:25] <greg-g>	 something is flapping, shinken reported a problem, then a recovery, now another problem :)
[17:43:52] <Krenair>	 Apparently the DBs have too many connections?
[17:43:56] <Krenair>	 Original exception: [762b386d] / DBConnectionError from line 871 of /srv/mediawiki/php-master/includes/db/Database.php: DB connection error: Too many connections (10.68.17.94)
[17:44:00] <Krenair>	 when I load the page
[17:44:27] <Krenair>	 Grr.
[17:44:33] <Krenair>	 enwiki and commonswiki show different errors.
[17:44:34] <greg-g>	 https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/fatalmonitor
[17:45:06] <Krenair>	 Well, same underlying error I presume.
[17:45:09] <Krenair>	 Just different error page formats
[17:45:26] <Krenair>	 How do you log in as mysql root again?
[17:45:50] <Krenair>	 sudo -i; myql
[17:45:53] <Krenair>	 mysql*
[17:46:19] <Krenair>	 root@deployment-db2:~# mysql
[17:46:19] <Krenair>	 ERROR 1040 (HY000): Too many connections
[17:50:25] <grrrit-wm>	 (03CR) 10Zfilipin: [C: 04-2] "I disagree that VisualEditor browser tests were fine before we deleted the jobs. They were deleted because they were broken for a month or" [integration/config] - 10https://gerrit.wikimedia.org/r/247612 (https://phabricator.wikimedia.org/T94162) (owner: 10Jforrester)
[17:50:29] <greg-g>	 ori: https://phabricator.wikimedia.org/P2262
[17:52:39] <marxarelli>	 looks like another bad query filled up /mnt/tmp
[17:52:55] <greg-g>	 bah :/
[17:53:00] <marxarelli>	 we should get a `show full processlist` this time
[17:53:04] <marxarelli>	 somehow ...
[17:53:06] <marxarelli>	 :(
[17:53:13] <ori>	 db1 is not doing anything
[17:53:24] <ori>	 still can't get to db2
[17:53:45] <marxarelli>	 anyone comfortable doing this? https://www.percona.com/blog/2010/03/23/too-many-connections-no-problem/
[17:53:59] * marxarelli isn't 100% on which pid to connect to
[17:54:20] <ori>	 from just viewing the file in /mnt/tmp i'm pretty sure it's flow
[17:54:36] <ori>	 root@deployment-db2:/mnt/tmp# strings \#sql_7912_0.MAD | head
[17:54:36] <ori>	 20130917231900
[17:54:37] <ori>	 post-summarycreate-topic-summaryenwikienwikiSandboxdiscussionutf-8,gzip,html
[17:54:47] <marxarelli>	 it was last time
[17:54:50] <Krenair>	 I can't actually connect anymore?
[17:54:51] <ori>	 that's the 82G file
[17:55:00] <marxarelli>	 i'd really like to get a full view of the query this time
[17:55:08] <ori>	 i can do https://www.percona.com/blog/2010/03/23/too-many-connections-no-problem/
[17:55:08] <marxarelli>	 so they can debug the root cause
[17:55:16] <ori>	 hahahaha
[17:55:16] <marxarelli>	 ori: sweet. hit it
[17:55:17] <greg-g>	 thanks
[17:55:22] <Krenair>	 I tried that with no luck ori
[17:55:23] <ori>	 "Credit for the gdb magic goes to Domas."
[17:55:33] <Krenair>	 haha
[17:55:55] <ori>	 the pid file is /mnt/sqldata/deployment-db2.pid
[17:56:11] <Krenair>	 yeah, 30994
[17:56:20] <Krenair>	 I got the same thing with "ps -e | grep mysql"
[17:56:45] <Krenair>	 but, "Could not attach to process.  If your uid matches the uid of the target" etc.
[17:56:50] <Krenair>	 ptrace: Operation not permitted.
[17:56:52] <Krenair>	 despite being root
[17:57:02] <marxarelli>	 try with sudo -u mysql?
[17:57:19] <Krenair>	 that didn't work when I tried it earlier
[17:58:06] <wmf-insecte>	 Project browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #708: 04FAILURE in 5.3 sec: https://integration.wikimedia.org/ci/job/browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/708/
[17:58:54] <Krenair>	 did it work for you ori?
[17:59:24] <ori>	 well ,/proc/sys/kernel/yama/ptrace_scope is 0
[17:59:34] <ori>	 the only other reason for getting this message is if someone is already attached to that process via gdb
[17:59:52] <ori>	 i see two gdb sessions open
[17:59:54] <ori>	 can i kill them?
[17:59:57] <ori>	 root      2042  0.1  0.1  64440 23000 pts/1    T    17:51   0:00 gdb -p 30994 -ex show variables like 'max_connections';
[17:59:59] <ori>	 root      2058  0.0  0.0  49016  8968 pts/1    T    17:52   0:00 gdb -p 30994 -ex set max_connections = 300
[18:00:09] * ori does
[18:00:12] <Krenair>	 yep
[18:00:18] <Krenair>	 the first one didn't work and I thought I had killed it
[18:00:20] <Krenair>	 apparently not
[18:02:08] <ori>	 got it
[18:04:19] <ori>	 i got the output of show full processlist but it contains an ip, what's the right perm in phab to ensure you guys can see it but not everyone else?
[18:05:08] <Krenair>	 just dump it on the labs instance itself?
[18:05:14] <greg-g>	 do nda for now, but beta is on labs :)
[18:05:52] <ori>	 sanitized: SELECT /* Flow\Formatter\ContributionsQuery::queryRevisions X.X.X.X */  *  FROM `flow_revision`,`flow_workflow`,`flow_tree_node`   WHERE rev_user_id = '820' AND rev_user_ip IS NULL AND rev_user_wiki = 'enwiki' AND (rev_id < 'G��p\0\0\0\0\0\0') AND workflow_wiki = 'enwiki' AND (tree_descendant_id = rev_type_id) AND rev_type = 'post-summary'  ORDER BY rev_id DESC LIMIT 51
[18:06:14] <marxarelli>	 and let's open a task for moving tmpdir to its own volume/partition
[18:06:37] <shinken-wm>	 RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 39002 bytes in 1.891 second response time
[18:06:37] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 30351 bytes in 2.432 second response time
[18:06:37] <Krenair>	 is that one being spammed by the same IP ori?
[18:07:29] <ori>	 it's just one query
[18:07:38] <ori>	 there are 300 checks for slave lag queued up behind it
[18:07:42] <ori>	 but it is the cause
[18:07:45] <greg-g>	 ori: https://phabricator.wikimedia.org/T78671#850799
[18:07:51] <ori>	 on db1 as /tmp/show_full_processlist.db1.1446228429
[18:08:02] <greg-g>	 "·
[18:08:05] <greg-g>	 Dec 16 2014, 16:17"
[18:08:28] <ori>	 i killed the query so it recovered
[18:08:35] <marxarelli>	 ori: thanks!
[18:08:53] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 38713 bytes in 0.805 second response time
[18:08:53] <ori>	 np
[18:08:59] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 38707 bytes in 1.357 second response time
[18:09:16] <marxarelli>	 ori: so, for future reference, the gdb thing worked? and you connected to the mysqld process?
[18:09:44] <wikibugs>	 10Beta-Cluster-Infrastructure, 6Collaboration-Team-Backlog, 10Flow, 3Collaboration-Team-Current: Beta Cluster Special:Contributions lags by a long time and notes slow Flow queries - https://phabricator.wikimedia.org/T78671#1769830 (10Catrope) Updated explain result of that query: ``` +------+-------------+...
[18:10:12] <ori>	 marxarelli: yep https://dpaste.de/AWxG/raw
[18:10:35] <marxarelli>	 ori: awesome. good to know
[18:11:01] <marxarelli>	 looks like slave replication is broken now ...
[18:11:05] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 38714 bytes in 0.968 second response time
[18:11:07] <marxarelli>	 the relay log is probably corrupt
[18:12:00] <ori>	 marxarelli: in fact since the process is still running probably good to set max connections to whatever it is normally
[18:12:06] <marxarelli>	 might be able to fix it with a slave reset
[18:12:31] <marxarelli>	 ori: it's normally 250
[18:12:36] <marxarelli>	 which seems fairly low
[18:12:55] <marxarelli>	 but yeah, i'll set it back
[18:13:06] <ori>	 no idea what a good value would be
[18:13:14] <ori>	 not something i know much about
[18:13:37] <marxarelli>	 usually just app processes + other processes + some change
[18:15:02] <marxarelli>	 i don't know enough about mw architecture to make a good estimate but 500 would still be conservative
[18:15:21] <ori>	 aaron or jynus would probably know
[18:15:26] <marxarelli>	 at nanowrimo i think we did ~ 2000
[18:15:43] * ori parts to keep set of irc channels manageable *wave*
[18:15:44] <marxarelli>	 but that was easier to compute
[18:15:49] <marxarelli>	 ori: :)
[18:15:51] <marxarelli>	 see ya
[18:16:41] <wikibugs>	 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team: Beta Cluster outage: deployment-db2 disk filled up, locked db replication - https://phabricator.wikimedia.org/T116447#1769852 (10Catrope) >>! In T116447#1769612, @Catrope wrote: > Looks like this is the same query as the one called out at {T78671}, whic...
[18:17:55] <marxarelli>	 !log stopping/resetting slave on deployment-db2 to fix replicate after relay log corruption
[18:18:01] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[18:18:23] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-db2 is OK: OK: All targets OK
[18:19:58] <marxarelli>	 !log deployment-db2 replication recovered after slave stop/reset/set master position/start
[18:20:03] <qa-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL, Master
[18:20:35] <marxarelli>	 well good thing it was only the relay binlog this time
[18:39:37] <wikibugs>	 10Deployment-Systems, 3Scap3: Scap3's checks.yaml file should be optional - https://phabricator.wikimedia.org/T116204#1769990 (10dduvall) 5Open>3Resolved
[18:54:08] <wikibugs>	 10Deployment-Systems, 3Scap3: Scap3 should have idempotent deploys - https://phabricator.wikimedia.org/T109513#1770044 (10dduvall) 5Open>3Resolved
[19:07:14] <marxarelli>	 greg-g: looks like the query that took down db2 is the same one from last friday https://phabricator.wikimedia.org/T116447#1758398
[19:07:52] <greg-g>	 "good"
[19:08:02] <greg-g>	 at least we get more data now!
[19:08:53] <marxarelli>	 it's essentially the same with a different rev_id range
[19:08:56] <marxarelli>	 but yeah, more!
[19:19:16] <wikibugs>	 10Beta-Cluster-Infrastructure, 6Collaboration-Team-Backlog, 10Flow, 3Collaboration-Team-Current: Beta Cluster Special:Contributions lags by a long time and notes slow Flow queries - https://phabricator.wikimedia.org/T78671#1770170 (10dduvall) Another instance of this query took down Beta Cluster again toda...
[19:56:49] <wikibugs>	 6Release-Engineering-Team, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1770275 (10Tgr) >>! In T102566#1768664, @BBlack wrote: > I haven't followed the low-level details too hard, but...
[20:16:12] <wikibugs>	 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 7Documentation: Create a Jenkins check to verify hooks.yaml formatting - https://phabricator.wikimedia.org/T116965#1770298 (10Tgr) The problem there is that libYAML implements YAML 1.1 while Symfony implements a subset of YAML 1.2.  For the...
[20:51:15] <wikibugs>	 10Continuous-Integration-Config, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Validator, 10Wikidata, 5Patch-For-Review: [Task] make core wmf branches only use submodule branches that run with it in CI - https://phabricator.wikimedia.org/T113731#1770348 (10mmodell) But I thought that media...
[21:11:20] <wmf-insecte>	 Yippee, build fixed!
[21:11:21] <wmf-insecte>	 Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #813: 09FIXED in 45 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/813/
[21:27:49] <wikibugs>	 10Continuous-Integration-Config, 5Continuous-Integration-Scaling: Nodeopool instances do not acquire hostname from cloud (always: 'debian') - https://phabricator.wikimedia.org/T117283#1770392 (10hashar) 3NEW a:3hashar
[21:30:41] <wikibugs>	 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 7Upstream: Nodeopool instances do not acquire hostname from cloud (always: 'debian') - https://phabricator.wikimedia.org/T117283#1770410 (10hashar) Our DIB manifest uses the `debian` element which injects: ``` $ cat /etc/cloud/cloud.cfg.d/01_h...
[21:46:35] <wikibugs>	 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 7Upstream: Nodeopool instances do not acquire hostname from cloud (always: 'debian') - https://phabricator.wikimedia.org/T117283#1770500 (10hashar) Upstream patch https://review.openstack.org/#/c/240614/
[21:55:53] <grrrit-wm>	 (03PS1) 10Hashar: nodepool: acquire hostname from cloud [integration/config] - 10https://gerrit.wikimedia.org/r/250148 (https://phabricator.wikimedia.org/T117283) 
[21:56:45] <grrrit-wm>	 (03CR) 10Hashar: "Need to rebuild the image ... https://wikitech.wikimedia.org/wiki/Nodepool#Diskimage" [integration/config] - 10https://gerrit.wikimedia.org/r/250148 (https://phabricator.wikimedia.org/T117283) (owner: 10Hashar)
[22:00:28] <wikibugs>	 6Release-Engineering-Team: Setup differential for https://github.com/wikimedia/composer-merge-plugin - https://phabricator.wikimedia.org/T117293#1770541 (10bd808) 3NEW
[22:00:50] <wikibugs>	 6Release-Engineering-Team, 15User-Bd808-Test: Setup differential for https://github.com/wikimedia/composer-merge-plugin - https://phabricator.wikimedia.org/T117293#1770550 (10bd808)
[22:20:28] <wikibugs>	 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 5Patch-For-Review, 7Upstream: Nodeopool instances do not acquire hostname from cloud (always: 'debian') - https://phabricator.wikimedia.org/T117283#1770603 (10hashar) Rebuilding an image: ``` dib-run-parts Fri Oct 30 22:19:39 UTC 2015 Runnin...
[22:20:40] <grrrit-wm>	 (03CR) 10Hashar: "Rebuilding an image:" [integration/config] - 10https://gerrit.wikimedia.org/r/250148 (https://phabricator.wikimedia.org/T117283) (owner: 10Hashar)
[22:21:19] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] "I havent uploaded the new image. Cause it is Friday :)" [integration/config] - 10https://gerrit.wikimedia.org/r/250148 (https://phabricator.wikimedia.org/T117283) (owner: 10Hashar)
[22:22:20] <grrrit-wm>	 (03Merged) 10jenkins-bot: nodepool: acquire hostname from cloud [integration/config] - 10https://gerrit.wikimedia.org/r/250148 (https://phabricator.wikimedia.org/T117283) (owner: 10Hashar)
[22:22:32] <wikibugs>	 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 5Patch-For-Review, 7Upstream, 7WorkType-Maintenance: Nodeopool instances do not acquire hostname from cloud (always: 'debian') - https://phabricator.wikimedia.org/T117283#1770606 (10hashar)
[22:28:13] <wikibugs>	 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 5Patch-For-Review, 7Upstream, 7WorkType-Maintenance: Nodeopool instances do not acquire hostname from cloud (always: 'debian') - https://phabricator.wikimedia.org/T117283#1770621 (10hashar) The new image is on labnodepool1001.eqiad.wmnet...
[22:28:46] <wikibugs>	 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 5Patch-For-Review, 7Upstream, 7WorkType-Maintenance: Nodeopool instances do not acquire hostname from cloud (always: 'debian') - https://phabricator.wikimedia.org/T117283#1770623 (10hashar) p:5Triage>3Normal
[22:32:34] <wikibugs>	 5Gerrit-Migration, 6Release-Engineering-Team, 15User-Bd808-Test: Setup differential for https://github.com/wikimedia/composer-merge-plugin - https://phabricator.wikimedia.org/T117293#1770541 (10hashar) I would love to get @bd808 involved in Differential as an early adopter. Might prove useful to gather feedb...
[22:48:17] <wikibugs>	 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team: Beta Cluster outage: deployment-db2 disk filled up, locked db replication - https://phabricator.wikimedia.org/T116447#1770662 (10hashar)
[22:49:37] <wikibugs>	 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team: Beta Cluster outage: deployment-db2 disk filled up, locked db replication - https://phabricator.wikimedia.org/T116447#1749727 (10hashar) Pro tip: you can syntax highlight IRC logs with in Markdown with `lang=irc`
[22:50:16] <wikibugs>	 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team: Beta Cluster outage: deployment-db2 disk filled up, locked db replication - https://phabricator.wikimedia.org/T116447#1770670 (10hashar) 5Resolved>3Open Reopening since we still follow up on determining the root cause of the outage.
[22:50:24] <wikibugs>	 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team: Beta Cluster outage: deployment-db2 disk filled up, locked db replication - https://phabricator.wikimedia.org/T116447#1770672 (10hashar) p:5Unbreak!>3High
[22:51:14] <wikibugs>	 10Beta-Cluster-Infrastructure, 7Tracking: Setup monitoring for Beta Cluster (tracking) - https://phabricator.wikimedia.org/T53497#1770676 (10hashar)
[22:51:15] <wikibugs>	 10Beta-Cluster-Infrastructure: Setup monitoring for database servers in beta cluster - https://phabricator.wikimedia.org/T87093#1770675 (10hashar) 5Open>3stalled
[22:51:31] <wikibugs>	 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team, 7Tracking: Use Beta cluster as a true canary for code deployments (tracking) - https://phabricator.wikimedia.org/T53494#1770680 (10hashar)
[22:51:32] <wikibugs>	 10Beta-Cluster-Infrastructure, 7Tracking: Setup monitoring for Beta Cluster (tracking) - https://phabricator.wikimedia.org/T53497#1770679 (10hashar) 5Open>3stalled
[22:52:07] <wikibugs>	 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team: [postmortem] Beta Cluster outage: deployment-db2 disk filled up, locked db replication - https://phabricator.wikimedia.org/T116447#1770681 (10hashar)
[23:56:40] <wikibugs>	 10Deployment-Systems, 6operations, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1770800 (10Dzahn) - moved "labsdbmanager"  to role keyword: https://gerrit.wikimedia.org/r/#/c/250083/ - moved base::firewall to the deployment-server role https://g...