[00:01:11] (03Merged) 10jenkins-bot: Fix Reedy's changelog entries [integration/config] - 10https://gerrit.wikimedia.org/r/436963 (owner: 10Legoktm) [00:01:47] PROBLEM - Puppet errors on deployment-certcentral is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [00:02:55] (03CR) 10Reedy: [C: 031] Rebuild images for composer update [integration/config] - 10https://gerrit.wikimedia.org/r/436964 (owner: 10Legoktm) [00:04:10] (03CR) 10Legoktm: [C: 032] Rebuild images for composer update [integration/config] - 10https://gerrit.wikimedia.org/r/436964 (owner: 10Legoktm) [00:04:15] third time is the charm [00:05:52] 10Phabricator (Upstream), 10Upstream: Option to Turn Off Status Updates in Phabricator Task-Threads - https://phabricator.wikimedia.org/T195728#4250107 (10Johnywhy) >>! In T195728#4248365, @Aklapper wrote: > Please do not re-subscribe me to this task. That's an inappropriate request. If lots of people made th... [00:06:17] (03Merged) 10jenkins-bot: Rebuild images for composer update [integration/config] - 10https://gerrit.wikimedia.org/r/436964 (owner: 10Legoktm) [00:08:34] building now [00:12:13] 10Continuous-Integration-Infrastructure, 10MediaWiki-General-or-Unknown: Bump symfony libraries when we longer need hhvm support - https://phabricator.wikimedia.org/T196206#4250120 (10Reedy) [00:12:56] 10Continuous-Integration-Infrastructure, 10MediaWiki-General-or-Unknown: Bump symfony libraries when we longer need hhvm support - https://phabricator.wikimedia.org/T196206#4250132 (10Reedy) 05Open>03stalled p:05Triage>03Low [00:19:59] RECOVERY - Puppet errors on deployment-certcentral-testclient is OK: OK: Less than 1.00% above the threshold [0.0] [00:21:50] RECOVERY - Puppet errors on deployment-certcentral is OK: OK: Less than 1.00% above the threshold [0.0] [00:25:59] PROBLEM - Puppet errors on deployment-certcentral-testclient is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [00:27:19] (03PS1) 10Legoktm: Bump docker images for composer rebuild [integration/config] - 10https://gerrit.wikimedia.org/r/436966 [01:16:06] !log running docker-pkg in a screen because my connection is super flaky [01:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [01:24:10] PROBLEM - Puppet errors on deployment-certcentral-testclient02 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [01:34:10] RECOVERY - Puppet errors on deployment-certcentral-testclient02 is OK: OK: Less than 1.00% above the threshold [0.0] [01:47:30] legoktm: How's progress? [01:53:05] (03CR) 10Reedy: [C: 031] Bump docker images for composer rebuild [integration/config] - 10https://gerrit.wikimedia.org/r/436966 (owner: 10Legoktm) [01:55:09] PROBLEM - Host deployment-certcentral-testclient is DOWN: CRITICAL - Host Unreachable (10.68.23.197) [02:03:30] (03CR) 10Legoktm: [C: 032] build: Updating mediawiki/mediawiki-codesniffer to 20.0.0 [integration/docroot] - 10https://gerrit.wikimedia.org/r/436935 (owner: 10Libraryupgrader) [02:04:49] 10Gerrit, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183#4250214 (10Paladox) Bump :) [02:08:55] (03PS2) 10Legoktm: build: Updating mediawiki/mediawiki-codesniffer to 19.0.0 [integration/docroot] - 10https://gerrit.wikimedia.org/r/436935 (owner: 10Libraryupgrader) [02:09:01] (03CR) 10Legoktm: [C: 032] build: Updating mediawiki/mediawiki-codesniffer to 19.0.0 [integration/docroot] - 10https://gerrit.wikimedia.org/r/436935 (owner: 10Libraryupgrader) [02:09:20] (03CR) 10jerkins-bot: [V: 04-1] build: Updating mediawiki/mediawiki-codesniffer to 19.0.0 [integration/docroot] - 10https://gerrit.wikimedia.org/r/436935 (owner: 10Libraryupgrader) [02:09:37] (03CR) 10jenkins-bot: build: Updating mediawiki/mediawiki-codesniffer to 19.0.0 [integration/docroot] - 10https://gerrit.wikimedia.org/r/436935 (owner: 10Libraryupgrader) [02:09:49] (03CR) 10Legoktm: [C: 032] "..." [integration/docroot] - 10https://gerrit.wikimedia.org/r/436935 (owner: 10Libraryupgrader) [02:15:58] No rush, but just in case the composer issue did go out, I still see it failing at https://gerrit.wikimedia.org/r/#/c/436968/ [02:16:37] Not sure which updated jobs have been pushed again [02:18:05] Krinkle: quibble jobs are still being deployed [02:18:11] I don't know why, but it's ridiculously slow [02:20:10] PROBLEM - Puppet errors on deployment-certcentral-testclient02 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [02:25:12] RECOVERY - Puppet errors on deployment-certcentral-testclient02 is OK: OK: Less than 1.00% above the threshold [0.0] [02:28:41] Krinkle: fixed now [02:43:26] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: operations-mw-config-composer-test-docker has composer version constraint regression - https://phabricator.wikimedia.org/T195688#4250220 (10Reedy) [03:07:46] PROBLEM - Puppet errors on deployment-certcentral is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [0.0] [03:12:47] RECOVERY - Puppet errors on deployment-certcentral is OK: OK: Less than 1.00% above the threshold [0.0] [05:34:30] PROBLEM - Free space - all mounts on integration-slave-docker-1003 is CRITICAL: CRITICAL: integration.integration-slave-docker-1003.diskspace.root.byte_percentfree (<11.11%) [06:16:33] PROBLEM - Free space - all mounts on deployment-fluorine02 is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine02.diskspace._srv.byte_percentfree (<30.00%) [06:46:33] RECOVERY - Free space - all mounts on deployment-fluorine02 is OK: OK: All targets OK [07:29:28] PROBLEM - Free space - all mounts on integration-slave-docker-1003 is CRITICAL: CRITICAL: integration.integration-slave-docker-1003.diskspace.root.byte_percentfree (<33.33%) [09:39:29] PROBLEM - Free space - all mounts on integration-slave-docker-1003 is CRITICAL: CRITICAL: integration.integration-slave-docker-1003.diskspace.root.byte_percentfree (<22.22%) [11:41:11] 10Phabricator (Upstream), 10Upstream: Option to Turn Off Status Updates in Phabricator Task-Threads - https://phabricator.wikimedia.org/T195728#4250482 (10Aklapper) [11:41:43] 10Phabricator (Upstream), 10Upstream: Option to Turn Off Status Updates in Phabricator Task-Threads - https://phabricator.wikimedia.org/T195728#4235323 (10Aklapper) 05Open>03declined No plans to implement an option to turn off status updates in task threads (which is what is requested in this task), hence... [12:09:27] PROBLEM - Free space - all mounts on integration-slave-docker-1003 is CRITICAL: CRITICAL: integration.integration-slave-docker-1003.diskspace.root.byte_percentfree (<33.33%) [13:09:26] PROBLEM - Free space - all mounts on integration-slave-docker-1003 is CRITICAL: CRITICAL: integration.integration-slave-docker-1003.diskspace.root.byte_percentfree (<11.11%) [13:44:49] (03CR) 10MGChecker: Add possibility to change allowed prefixes (031 comment) [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/436633 (https://phabricator.wikimedia.org/T191812) (owner: 10MGChecker) [13:50:15] (03PS8) 10MGChecker: Add possibility to change allowed prefixes [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/436633 (https://phabricator.wikimedia.org/T191812) [13:50:57] (03CR) 10jerkins-bot: [V: 04-1] Add possibility to change allowed prefixes [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/436633 (https://phabricator.wikimedia.org/T191812) (owner: 10MGChecker) [13:52:38] (03PS9) 10MGChecker: Add possibility to change allowed prefixes [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/436633 (https://phabricator.wikimedia.org/T191812) [13:53:25] (03CR) 10jerkins-bot: [V: 04-1] Add possibility to change allowed prefixes [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/436633 (https://phabricator.wikimedia.org/T191812) (owner: 10MGChecker) [13:54:56] (03PS10) 10MGChecker: Add possibility to change allowed prefixes [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/436633 (https://phabricator.wikimedia.org/T191812) [13:55:36] (03CR) 10jerkins-bot: [V: 04-1] Add possibility to change allowed prefixes [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/436633 (https://phabricator.wikimedia.org/T191812) (owner: 10MGChecker) [13:59:26] (03PS11) 10MGChecker: Add possibility to change allowed prefixes [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/436633 (https://phabricator.wikimedia.org/T191812) [14:00:27] (03CR) 10jerkins-bot: [V: 04-1] Add possibility to change allowed prefixes [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/436633 (https://phabricator.wikimedia.org/T191812) (owner: 10MGChecker) [14:23:00] 10Beta-Cluster-Infrastructure, 10Puppet, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#4250625 (10Krenair) [14:23:04] 10Beta-Cluster-Infrastructure, 10Operations, 10media-storage, 10Patch-For-Review, 10Puppet: Puppet broken on deployment-ms-be0[34] with evaluation error in swift module - https://phabricator.wikimedia.org/T184236#4250623 (10Krenair) 05Resolved>03Open cherry-picked, not merged [14:25:31] 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Move puppetmaster to Stretch - https://phabricator.wikimedia.org/T195686#4250627 (10Krenair) [14:25:47] 10Beta-Cluster-Infrastructure: Move puppetmaster to Stretch - https://phabricator.wikimedia.org/T195686#4234225 (10Krenair) [14:27:40] PROBLEM - Puppet errors on deployment-certcentral02 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [0.0] [14:37:08] (03CR) 10Reedy: [C: 031] "Shouldn't this have been merged? ;)" [integration/config] - 10https://gerrit.wikimedia.org/r/436966 (owner: 10Legoktm) [14:41:10] PROBLEM - Puppet errors on deployment-certcentral-testclient02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:47:40] RECOVERY - Puppet errors on deployment-certcentral02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:09:39] 10Release-Engineering-Team (Watching / External), 10Operations, 10Patch-For-Review: setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4250679 (10Dzahn) [15:23:52] 10MediaWiki-Codesniffer, 10MediaWiki-extensions-Variables, 10Patch-For-Review: Allow configuring MediaWiki.NamingConventions.ValidGlobalName.wgPrefix to allow additional prefixes - https://phabricator.wikimedia.org/T191812#4250697 (10MGChecker) Am I allowed to change the name from wgPrefix to allowedPrefix?... [15:27:44] PROBLEM - Puppet errors on deployment-redis01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:33:43] PROBLEM - Puppet errors on deployment-redis02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:39:29] PROBLEM - Free space - all mounts on integration-slave-docker-1003 is CRITICAL: CRITICAL: integration.integration-slave-docker-1003.diskspace.root.byte_percentfree (<44.44%) [15:43:11] 10Phabricator, 10Developer-Relations (Apr-Jun-2018), 10Patch-For-Review: Try to identify new developers (via assignee field) in Phab tasks and potentially follow up - https://phabricator.wikimedia.org/T195780#4250723 (10Aklapper) Querying the //complete// transaction table and counting for each person how of... [17:04:27] PROBLEM - Free space - all mounts on integration-slave-docker-1003 is CRITICAL: CRITICAL: integration.integration-slave-docker-1003.diskspace.root.byte_percentfree (<11.11%) [17:17:43] PROBLEM - Host deployment-certcentral is DOWN: CRITICAL - Host Unreachable (10.68.18.193) [17:21:08] RECOVERY - Puppet errors on deployment-certcentral-testclient02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:27:11] PROBLEM - Puppet errors on deployment-certcentral-testclient02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:37:08] RECOVERY - Puppet errors on deployment-certcentral-testclient02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:37:55] (03CR) 10Legoktm: [C: 032] "Yep" [integration/config] - 10https://gerrit.wikimedia.org/r/436966 (owner: 10Legoktm) [17:39:41] (03Merged) 10jenkins-bot: Bump docker images for composer rebuild [integration/config] - 10https://gerrit.wikimedia.org/r/436966 (owner: 10Legoktm) [18:07:11] !log Beta Cluster's RESTBase or Parsoid is broken. Saving VE times out, logstash-beta contain restbase: "internal_http_error" / "Error: ESOCKETTIMEDOUT" [18:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:12:41] PROBLEM - Puppet errors on deployment-deploy-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:45:13] PROBLEM - Puppet errors on deployment-mx is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:48:09] hm [18:50:02] Krinkle, where do I find this log entry on fluorine? [18:50:56] Krenair: It's type:restbase, tags: es, gelf, normalized_message_untrimmed, host:deployment-restbase01 [18:51:00] so probably not on fluorine [18:51:06] goes from restbase to es/logstash directly afaik [18:51:13] password for beta logstash changed, no longer LDAP [18:51:27] it's in deployment-tin:/root/secrets.txt [18:52:07] message:"504: internal_http_error" [18:52:22] about 50x/min [18:52:36] jeez, people were sending their real LDAP passwords into labs? [18:52:54] https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/User%3AKrenair%2FSandbox?redirect=false is failing [18:52:59] backend be_deployment_restbase01_deployment_prep_eqiad_wmflabs { [18:52:59] .host = "deployment-restbase01.deployment-prep.eqiad.wmflabs"; [18:52:59] .port = "7231"; [18:53:46] I can curl it locally [18:54:05] (From the restbase01 host) [18:54:15] in fact I can curl it from cache-text04 too [18:55:30] ah [18:55:30] this is taking time [18:55:36] krenair@deployment-restbase01:~$ curl http://localhost:7231/en.wikipedia.beta.wmflabs.org/v1/page/html/User%3AKrenair%2FSandbox [18:57:28] alright [18:57:34] deployment-restbase01:/srv/log/restbase/main.log [18:58:42] so RB itself is erroring trying to get... something from somewhere [18:59:26] PROBLEM - Free space - all mounts on integration-slave-docker-1003 is CRITICAL: CRITICAL: integration.integration-slave-docker-1003.diskspace.root.byte_percentfree (<33.33%) [18:59:53] res.body.uri is "http://deployment-parsoid09.deployment-prep.eqiad.wmflabs:8000/en.wikipedia.beta.wmflabs.org/v3/page/pagebundle/User%3AKrenair%2FSandbox/268834" [19:02:07] parsoid has some requests going through [19:02:21] but nothing in deployment-parsoid09:/srv/log/parsoid/main.log about pagebundle [19:06:02] in fact trying to curl that URL from inside deployment-parsoid09 is proving slow [19:06:48] parsoid is clearly up [19:07:09] so this'll be something wrong with parsoid or maybe it's connection to the wikis [19:10:46] https://phabricator.wikimedia.org/P7204 [19:17:35] nothing is appearing in /srv/log/parsoid/main.log about these requests when I run them [19:19:28] PROBLEM - Free space - all mounts on integration-slave-docker-1003 is CRITICAL: CRITICAL: integration.integration-slave-docker-1003.diskspace.root.byte_percentfree (<11.11%) [19:29:59] basically any requests to parsoid other than GET / fails [19:30:04] it just sits there [19:31:01] !log restarted parsoid on deployment-parsoid09 to try to fix stuff [19:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [19:31:57] Krinkle, that appears to have done the trick [19:32:16] * Krinkle is writing an epic task unrelated to this [19:32:41] Krenair: Cool, yeah, save is working now [19:32:42] Yay! [19:32:44] thanks! [19:33:16] Anything in the server's local logs leading up to the problem? [19:34:21] no [19:34:30] except unrelated stuff [19:35:00] requests relating to ChangePropagation and Mobile-Content-Service were going through it seemed, for some reason [19:36:16] e.g. [19:36:29] {"name":"parsoid","hostname":"deployment-parsoid09","pid":24,"level":30,"logType":"info","wiki":"enwiki","title":"Kenya_national_football_team","oldId":118612,"reqId":"1335903d-70e8-4dbf-9612-04d3dd5d96e6","userAgent":"ChangePropagation/WMF","msg":"completed wt2html in 2320ms","longMsg":"completed wt2html in 2320ms","levelPath":"info","time":"2018-06-02T19:30:24.991Z","v":0} [19:36:29] {"name":"parsoid","hostname":"deployment-parsoid09","pid":24,"level":30,"logType":"info","wiki":"enwiki","title":"Dennis_Oliech","oldId":null,"reqId":"1335903d-70e8-4dbf-9612-04d3dd5d96e6","userAgent":"ChangePropagation/WMF","msg":"started wt2html","longMsg":"started wt2html","levelPath":"info","time":"2018-06-02T19:30:25.016Z","v":0} [19:36:29] {"name":"parsoid","hostname":"deployment-parsoid09","pid":2,"level":30,"levelPath":"info/service-runner/master","msg":"master shutting down, killing workers","time":"2018-06-02T19:30:25.259Z","v":0} [19:37:17] there didn't appear to be any logs going in relating to my requests from curl [19:38:36] I did notice that 'sudo service parsoid status' was complaining about needing a journal reload or something [19:57:49] !log gjg@integration-slave-docker-1003:/srv/jenkins-workspace/workspace$ sudo rm -rf * [19:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [19:59:28] PROBLEM - Free space - all mounts on integration-slave-docker-1003 is CRITICAL: CRITICAL: integration.integration-slave-docker-1003.diskspace.root.byte_percentfree (<11.11%) [20:17:09] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: operations-mw-config-composer-test-docker has composer version constraint regression - https://phabricator.wikimedia.org/T195688#4251006 (10Reedy) 05Open>03Resolved a:03Reedy Yay [20:25:40] not sure why shinken hasn't caught up yet, there's only 4% usage on that /srv partition [20:25:42] * greg-g goes afk [21:06:02] greg-g, it's the root partition [21:06:10] /dev/vda3 19G 17G 1.1G 95% / [21:07:41] 15G /var [21:08:07] 15G /var/lib [21:08:20] 15G /var/lib/docker [21:08:40] 15G /var/lib/docker/overlay2 [21:09:55] there are 31 files in there that are hundreds of megabytes in size [21:10:01] but I don't know anything about docker so [21:10:12] legoktm might know [21:11:56] thats dockers default path [21:12:07] i think they were talking about it yesturday. [21:12:17] maybe symnlinking it to /srv may work? [21:18:00] hi [21:18:05] which instance? [21:18:11] Krenair, greg-g [21:18:22] integration-slave-docker-1003 [21:18:27] looking [21:18:30] it's not urgent [21:18:35] AFAIK [21:18:39] just shinken moaning [21:20:20] !log legoktm@integration-slave-docker-1003:~$ sudo docker rmi $(sudo docker images -q) [21:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [21:21:09] I just deleted all the local docker images [21:21:21] if they're still used, it'll pull them in the next run [21:22:24] legoktm Krenair: aha, thanks! [21:29:30] RECOVERY - Free space - all mounts on integration-slave-docker-1003 is OK: OK: All targets OK [21:30:30] oh so that's just a local cache of stuff, a lot of which is likely to be outdated? [21:30:40] would there be any value in some automatic cleaning script? [21:43:15] Krenair: yes, we should have an auto cleanup script [21:43:50] I wrote one-ish in Python yesterday but didn't save it [21:44:30] it would just run `docker images` to get the list of installed images, figure out which images are outdated (if there's a newer tag), and then run `docker rmi ...` for those [22:01:42] other high disk usage cases in that project [22:02:29] https://phabricator.wikimedia.org/P7205 [22:19:18] Why does deployment-mediawiki-07 have a /mnt/mediawiki ? [22:19:25] It is taking up 7G [22:30:27] PROBLEM - Host deployment-puppetmaster02 is DOWN: CRITICAL - Host Unreachable (10.68.21.200)