[02:21:05] RECOVERY - Mathoid on deployment-mathoid is OK: HTTP OK: HTTP/1.1 200 OK - 925 bytes in 0.022 second response time [02:27:05] PROBLEM - Mathoid on deployment-mathoid is CRITICAL: connect to address 172.16.5.73 and port 10042: Connection refused [02:52:04] RECOVERY - Mathoid on deployment-mathoid is OK: HTTP OK: HTTP/1.1 200 OK - 925 bytes in 0.035 second response time [03:03:06] PROBLEM - Mathoid on deployment-mathoid is CRITICAL: connect to address 172.16.5.73 and port 10042: Connection refused [03:08:04] RECOVERY - Mathoid on deployment-mathoid is OK: HTTP OK: HTTP/1.1 200 OK - 925 bytes in 0.030 second response time [04:36:06] 10Beta-Cluster-Infrastructure, 10Editing-team, 10Release Pipeline, 10serviceops, and 3 others: Migrate Beta cluster services to use Kubernetes - https://phabricator.wikimedia.org/T220235 (10Joe) >>! In T220235#5171351, @Ottomata wrote: > An example of environmental differences: service-runner uses statsd.... [05:02:15] 10Beta-Cluster-Infrastructure, 10Editing-team, 10Release Pipeline, 10serviceops, and 3 others: Migrate Beta cluster services to use Kubernetes - https://phabricator.wikimedia.org/T220235 (10Joe) >>! In T220235#5174255, @Krinkle wrote: > The status quo is that services always run their code in beta before i... [05:20:55] PROBLEM - Free space - all mounts on deployment-fluorine02 is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine02.diskspace._srv.byte_percentfree (<40.00%) [05:25:06] PROBLEM - Mathoid on deployment-mathoid is CRITICAL: connect to address 172.16.5.73 and port 10042: Connection refused [06:17:57] Are there some crazy gotchas to working with the Gerrit API, by any chance? I just made the simplest of GET requests and the response is truncated in a way making it useless. https://gerrit.wikimedia.org/r/changes/I289f6a5c3a8696373d6693fed66d14dcfc3a3bd7/comments if anyone is curious. [06:18:26] It's actually truncated from the *beginning*, so I'm only getting the last half of a JSON document. [06:21:36] awight, you mean the way it begins with )]}' ? [06:24:36] Krenair: Yes. Looking closer, I see that the ending brace matches the brace after the ")]}'". [06:24:53] It's not valid JSON [06:24:55] So the document might not be truncated, just prefixed with crazy. [06:25:01] It's always appeared at the start of every gerrit response [06:25:54] * awight watches the Cheshire cat slowly fade [06:25:59] wat. [06:26:18] "To prevent against Cross Site Script Inclusion (XSSI) attacks, the JSON response body starts with a magic prefix line that must be stripped before feeding the rest of the response body to a JSON parser" [06:26:22] straight out of the gerrit docs [06:26:39] Thank you. [06:27:10] https://gerrit-review.googlesource.com/Documentation/rest-api.html#output [06:27:13] I was also thrown off by the changes not appearing in order, but that's fine. [06:31:08] It's probably questionable for them to be serving Content-Type: application/json while doing that. [06:31:57] * awight ltrims prefix and then evaluates the remainder as javascript in browser ;-) [06:34:40] ಠ_ಠ [06:39:06] definitely sketchy [07:00:02] Project mediawiki-core-code-coverage-docker build #4246: 04STILL FAILING in 4 hr 0 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage-docker/4246/ [07:10:53] RECOVERY - Free space - all mounts on deployment-fluorine02 is OK: OK: All targets OK [07:15:21] (03CR) 10Jforrester: Delay npm install to selenium/qunit stages (031 comment) [integration/quibble] - 10https://gerrit.wikimedia.org/r/509466 (owner: 10Hashar) [07:16:25] there's something weird about the deployment calendar (the wiki page) today [07:16:53] while tues/wed/thur all how the right move in the table [07:17:00] they all have group2 to 1.34.0-wmf.5 in the text [07:18:00] Oh. I'll fix. [07:18:23] *all have [07:20:15] Krenair: PolyGerrit strips the prefix (even though that part is invalid json, the rest is valid) [07:27:22] apergos: Fixed. [07:27:25] 10Release-Engineering-Team, 10MediaWiki-Core-Testing, 10Patch-For-Review, 10Wikimedia-production-error (Shared Build Failure), 10phan: phan 1.2.6 is OOMing on MediaWiki core - https://phabricator.wikimedia.org/T219114 (10awight) I just experienced this crash 4 times in a row, when running CI on mw-ext-Fi... [07:27:46] ty! [07:28:29] (03PS1) 10Awight: Upgrade Phan image used for extension checks [integration/config] - 10https://gerrit.wikimedia.org/r/509758 (https://phabricator.wikimedia.org/T219114) [07:34:05] (03CR) 10Hashar: [C: 03+2] "Because the emoji chars were not escaped properly. Ie on this change the bot output:" [integration/config] - 10https://gerrit.wikimedia.org/r/508412 (owner: 10Hashar) [07:41:40] hey James_F, still around? [07:48:53] 10Continuous-Integration-Config, 10MediaWiki-extensions-NSFileRepo: NSFileRepo depends upon Lockdown extension - https://phabricator.wikimedia.org/T185610 (10Pwirth) the dependency to the extension Lockdown was removed here: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/NSFileRepo/+/456648/ [07:52:30] 10Continuous-Integration-Infrastructure, 10Wikidata, 10Wikidata-Campsite, 10User-zeljkofilipin: Run browser tests as part of "npm test" of wikidata/query/gui - https://phabricator.wikimedia.org/T222200 (10noarave) @zeljkofilipin I will be at the Hackathon and would gladly pair on this. I arrive on Fri 17.05. [08:22:45] (03CR) 10Hashar: [C: 03+2] 2.5.1-wmf9: drop faulty patch [integration/zuul] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/509417 (https://phabricator.wikimedia.org/T140297) (owner: 10Hashar) [08:22:57] (03CR) 10Hashar: [C: 03+2] "Already rebuild and published on apt.wikimedia.org" [integration/zuul] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/509417 (https://phabricator.wikimedia.org/T140297) (owner: 10Hashar) [08:25:52] (03Merged) 10jenkins-bot: 2.5.1-wmf9: drop faulty patch [integration/zuul] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/509417 (https://phabricator.wikimedia.org/T140297) (owner: 10Hashar) [08:31:54] Krenair: Back now. How can I help? [08:33:59] 10Continuous-Integration-Infrastructure, 10Jenkins: Jenkins jobs regularly being queued while resources appear to be readily available - https://phabricator.wikimedia.org/T218458 (10hashar) > ### Zuul status page > > * executing: 16 jobs > * **waiting: 41 jobs** (//queued//) > > {F28395644} This show the `g... [08:41:28] (03CR) 10Jforrester: "This won't help for the tagged task, I believe. That's waiting on https://gerrit.wikimedia.org/r/c/mediawiki/tools/phan/+/506064/ to be me" [integration/config] - 10https://gerrit.wikimedia.org/r/509758 (https://phabricator.wikimedia.org/T219114) (owner: 10Awight) [08:46:09] <_joe_> hi [08:46:20] <_joe_> https://phabricator.wikimedia.org/T222935 might suggest a quick revert I think [08:46:23] <_joe_> hashar: ^^ [08:46:34] <_joe_> (also, good morning :D) [08:49:42] hi _joe_ [08:50:04] RECOVERY - Mathoid on deployment-mathoid is OK: HTTP OK: HTTP/1.1 200 OK - 925 bytes in 0.024 second response time [08:50:07] <_joe_> eek all wikis are on .4 [08:50:26] <_joe_> so no, reverting to .3 is not an option [08:50:33] <_joe_> we need someone to look into this [08:51:08] it is not like have any clue about what mediawiki is doing nowadays :-\ [08:58:04] (03CR) 10Awight: "> This won't help for the tagged task, I believe. That's waiting on" [integration/config] - 10https://gerrit.wikimedia.org/r/509758 (https://phabricator.wikimedia.org/T219114) (owner: 10Awight) [08:58:07] _joe_: well I don't know what should be expected. But Raymond seems to point at a commit which might show that has been done intentionally [08:58:20] anyway, I will try to find folks familiar with commons page descriptions [08:58:38] <_joe_> thanks for handling this [09:02:06] PROBLEM - Mathoid on deployment-mathoid is CRITICAL: connect to address 172.16.5.73 and port 10042: Connection refused [09:21:34] 10Beta-Cluster-Infrastructure, 10Editing-team, 10Release Pipeline, 10serviceops, and 3 others: Migrate Beta cluster services to use Kubernetes - https://phabricator.wikimedia.org/T220235 (10Krinkle) >>! In T220235#5175206, @Joe wrote: >>>! In T220235#5174255, @Krinkle wrote: >> The status quo is that servi... [09:23:39] hashar: It was not intentional. What the team (tried to) change is removing the new "structured data" section from the inherited content. [09:23:52] It appears to instead have the adverse affect of removing the whole thing. [09:34:55] <_joe_> I need to create a new gerrit group for my SRE subteam; what's the procedure to do so? [09:39:28] Krinkle: No, it's a CPT bug it seems. [09:43:20] _joe_: are you an admin? [09:43:57] If not I can create one :) [09:45:05] oh, if you can just do it I will not finish this ticket [09:45:24] wmf-sre-serviceops please [09:45:30] Ok [09:45:33] * paladox does [09:45:48] with the members of the service ops team in it: and I guess able to administer it? [09:46:22] joe ako siaris, me, mu tante, f sero, effi e, I'm missing someone and embarrassed [09:46:40] <_joe_> mark [09:46:42] <_joe_> :D [09:46:49] pshaw ;-D [09:47:17] <_joe_> ? [09:47:44] https://en.wiktionary.org/wiki/pshaw [09:47:50] here used for 'scoffery' :-P [09:48:00] <_joe_> hah :D [09:48:09] apergos: https://gerrit.wikimedia.org/r/#/admin/groups/1645,members [09:48:16] (It owns its self :)) [09:48:28] right we can just add the rest [09:48:29] ty [09:48:41] <_joe_> paladox: <3 [09:49:32] I think I’ve added everyone now [09:49:43] (I’ve removed my self so cannot edit it now :)) [09:49:47] I added them too [09:49:53] well they're really added then :-d [09:50:03] Heh [09:50:23] ty this is great [09:50:43] :) [10:09:05] Krinkle: thank you for the explanation :) Turns out it was an issue in mediawiki/core issue ( HttpRequestFactory->get( $url ) always returning null ) [10:19:39] (03PS4) 10Hashar: zuul: skip test/test-prio for CR+2 changes [integration/config] - 10https://gerrit.wikimedia.org/r/508512 (https://phabricator.wikimedia.org/T105474) [10:21:16] 10Phabricator-Sprint-Extension: Call to undefined method SprintProjectProfilePanelEngine::buildNavigation() when accessing Burndown since 2019.16 - https://phabricator.wikimedia.org/T222586 (10Aklapper) [10:22:00] _joe_: eventually the unbreak now from this morning has a fix and is going to be deployed ( was https://phabricator.wikimedia.org/T222935 ) thx for the ping! [10:22:04] RECOVERY - Mathoid on deployment-mathoid is OK: HTTP OK: HTTP/1.1 200 OK - 925 bytes in 0.023 second response time [10:22:13] <_joe_> great :) [10:22:37] (03CR) 10Hashar: [C: 03+2] zuul: skip test/test-prio for CR+2 changes (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/508512 (https://phabricator.wikimedia.org/T105474) (owner: 10Hashar) [10:24:06] (03Merged) 10jenkins-bot: zuul: skip test/test-prio for CR+2 changes [integration/config] - 10https://gerrit.wikimedia.org/r/508512 (https://phabricator.wikimedia.org/T105474) (owner: 10Hashar) [10:26:56] 10Continuous-Integration-Infrastructure, 10Wikidata, 10Wikidata-Campsite, 10Wikimedia-Hackathon-2019, 10User-zeljkofilipin: Run browser tests as part of "npm test" of wikidata/query/gui - https://phabricator.wikimedia.org/T222200 (10zeljkofilipin) [10:28:05] PROBLEM - Mathoid on deployment-mathoid is CRITICAL: connect to address 172.16.5.73 and port 10042: Connection refused [10:37:26] !log Rolling CI config change https://gerrit.wikimedia.org/r/508512 which caused some patches to not be processed last week # https://wikitech.wikimedia.org/wiki/Incident_documentation/20190506-zuul / T105474 [10:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [10:37:30] T105474: 'recheck' on a CR+2 patch should trigger gate-and-submit, not test - https://phabricator.wikimedia.org/T105474 [10:37:31] this time with a bug fix :) [10:37:38] and a test covering the misbehavior [10:46:00] (03CR) 10Awight: "AIUI, this patch won't upgrade to Phan 1.3.2, but instead it will benefit from the workaround in Id49f83699869c6e0ef044b611da3fb0e4b934be7" [integration/config] - 10https://gerrit.wikimedia.org/r/509758 (https://phabricator.wikimedia.org/T219114) (owner: 10Awight) [10:47:02] 10Continuous-Integration-Config, 10Release-Engineering-Team (Kanban), 10Zuul, 10Patch-For-Review, 10Upstream: 'recheck' on a CR+2 patch should trigger gate-and-submit, not test - https://phabricator.wikimedia.org/T105474 (10hashar) 05Open→03Resolved So now it should be fixed. The second iteration of... [10:47:03] hashar: Want to C+2 https://gerrit.wikimedia.org/r/c/mediawiki/tools/phan/+/509769 for me so we can theoretically fix the OOM issue? [10:47:37] ^ <3 James_F [10:47:57] Sadly, at least 110% of my job is to theoretically fix things. [10:48:18] James_F: I think we also have some CI container to bump [10:48:26] <_joe_> paladox: can I abuse your kindness and ask you how do I create a CR for a new dashboard like https://gerrit.wikimedia.org/r/#/c/wikimedia/+/503038/ [10:48:39] <_joe_> I can see https://gerrit.wikimedia.org/r/#/admin/projects/wikimedia,dashboards [10:49:35] <_joe_> so it seems I should create a review for a branch like refs/meta/dashboards/sre-serviceops [10:49:42] hashar: The container now uses whatever the individual repo so we'll need to bump in each repo. [10:49:45] James_F: the workaround/right thing for ci is to disable phan progress bar when running on CI [10:49:48] <_joe_> but when I try to push a review it doesn't work [10:50:00] hashar: That is already merged upstream. [10:50:02] <_joe_> ! [remote rejected] HEAD -> refs/for/refs/meta/dashboards/sre-serviceops (refs/meta/dashboards/sre-serviceops not found) [10:50:02] _joe_: how are you pushing? [10:50:13] Will brb in a bit (in lesson) [10:50:18] <_joe_> sorry :D [10:50:33] <_joe_> whenever you have time, else I'll figure it out [10:50:49] Oh it’s fine :) [10:51:25] <_joe_> I guess I need someone to create the branch [10:52:39] James_F: yeah I guess we can just do it [10:52:41] 10RelEng-Archive-FY201718-Q1, 10Scap, 10Patch-For-Review: Scap: keyholder Too many authentication failures - https://phabricator.wikimedia.org/T172333 (10akosiaris) https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/377269/ had fallen through the cracks. It's now merged, right before a SWAT window, in or... [10:52:44] _joe_: refs/meta/dashboard [10:52:56] hashar: We can do both. [10:52:56] I mean the version bump for mediawiki/tools/phan . Then there is no need to update the repositories [10:52:58] That should be the branch :) [10:53:26] James_F: well we will have to bump the version eventually but there is no hurry as long as we get the CI container to disable the progress bar we should be fine [10:54:35] James_F: also Kunal does signed tag for mediawiki/tools/phan . I have no clue whether packagist enforces those gpg signatures or not [10:56:00] hashar: speaking of obvious things Phan should have caught - https://phabricator.wikimedia.org/T223085 [10:56:29] hashar: I do too. Gerrit won't show you but that commit is signed. [10:57:31] James_F: ;]] [10:57:56] and I am gonna deploy awight bump of releng/mediawiki-phan CI image [10:58:52] Krinkle: maybe a newer version of phan would have caught it. I must say I have not been any active on phan front ;/ [10:58:59] <_joe_> paladox: uhm not according to the docs, but I guess someone needs to create that branch. Thanks anyways :)) [10:59:14] hashar: it's the most basic thing, a variable that is never defined. Phan 1.x caught this 2 years ago [10:59:29] I'm pretty sure that either Phan regressed, or it was wrongly disabled, or it's catching it but Jenkins not enforcing it [10:59:38] :-\ [11:01:13] Phan isn't a magic bullet ;) [11:08:52] 10Release-Engineering-Team, 10MediaWiki-SWAT-deployments: SWAT Shepherds seem wrong - https://phabricator.wikimedia.org/T223087 (10Reedy) [11:10:52] (03CR) 10Hashar: [C: 03+2] "Thank you! I have updated the jobs mwext-php70-phan-docker mwskin-php70-phan-docker php70-phan-docker" [integration/config] - 10https://gerrit.wikimedia.org/r/509758 (https://phabricator.wikimedia.org/T219114) (owner: 10Awight) [11:11:34] 10Release-Engineering-Team, 10MediaWiki-Core-Testing, 10Patch-For-Review, 10Wikimedia-production-error (Shared Build Failure), 10phan: phan 1.2.6 is OOMing on MediaWiki core - https://phabricator.wikimedia.org/T219114 (10hashar) I have updated the jobs mwext-php70-phan-docker mwskin-php70-phan-docker php... [11:13:33] (03Merged) 10jenkins-bot: Upgrade Phan image used for extension checks [integration/config] - 10https://gerrit.wikimedia.org/r/509758 (https://phabricator.wikimedia.org/T219114) (owner: 10Awight) [11:18:17] 10Release-Engineering-Team, 10MediaWiki-Core-Testing, 10Patch-For-Review, 10Wikimedia-production-error (Shared Build Failure), 10phan: phan 1.2.6 is OOMing on MediaWiki core - https://phabricator.wikimedia.org/T219114 (10hashar) 05Open→03Resolved a:03hashar releng/mediawiki-phan:0.14.0 no more enfo... [11:32:48] _joe_: you should be able to create the branch [11:32:53] Which repo is this for? :) [11:33:02] (Also back for ~40mins) [11:33:23] he's already pushed it [11:35:17] <_joe_> paladox: done, thanks [11:35:36] Ah ok :) [11:38:04] RECOVERY - Mathoid on deployment-mathoid is OK: HTTP OK: HTTP/1.1 200 OK - 925 bytes in 0.026 second response time [11:43:41] Thanks hashar, I'll post to the Phan OOM task if the issue was not resolved by the workaround! [11:44:03] PROBLEM - Mathoid on deployment-mathoid is CRITICAL: connect to address 172.16.5.73 and port 10042: Connection refused [11:51:10] 10Continuous-Integration-Infrastructure, 10Zuul, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: Upload zuul_2.5.1-wmf9 to apt.wikimedia.org - https://phabricator.wikimedia.org/T222689 (10hashar) Upgraded and it seems to work fine. Thank you! [12:06:37] 10Beta-Cluster-Infrastructure, 10Editing-team, 10Release Pipeline, 10serviceops, and 3 others: Migrate Beta cluster services to use Kubernetes - https://phabricator.wikimedia.org/T220235 (10hashar) On beta, we had Parsoid and some other services deployed via Jenkins whenever a change got merged. Then each... [12:28:20] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10puppet-compiler: Puppet catalog compiler - increasing max concurrent jobs - https://phabricator.wikimedia.org/T221969 (10hashar) >>! In T221969#5167384, @herron wrote: > @hashar while on the topic, is it possible for Jenkins to more evenl... [12:34:23] 10Gerrit, 10Repository-Admins, 10Shape Expressions, 10Wikidata, and 2 others: rename repository for WikibaseSchema - https://phabricator.wikimedia.org/T221946 (10Michael) 05Open→03Resolved [12:34:33] 10Gerrit, 10Repository-Admins, 10Shape Expressions, 10Wikidata, and 2 others: rename repository for WikibaseSchema - https://phabricator.wikimedia.org/T221946 (10Michael) [12:38:47] The extensions phan job is broken after my patch. e.g. https://integration.wikimedia.org/ci/job/mwext-php70-phan-docker/28628/consoleFull [12:48:00] (03PS1) 10Awight: Revert "Upgrade Phan image used for extension checks" [integration/config] - 10https://gerrit.wikimedia.org/r/509814 (https://phabricator.wikimedia.org/T219114) [12:48:18] hashar: ^ Sorry, this didn't work. [12:48:35] awight: what is broken? [12:49:45] hashar: Two things--the fix turns out to only be applied to the run-core.sh script so it wouldn't help anyway, and * for some unknown reason, phan is now rejecting the "-m" flag although it is listed in the usage text. [12:49:59] Here's an example failure, https://integration.wikimedia.org/ci/job/mwext-php70-phan-docker/28627/consoleFull [12:50:18] lol [12:50:21] ;D [12:50:53] Harrr. I could have run this locally and discovered before deployment, apologies. [12:51:48] awight: I guess it doesn’t allow options after files? [12:52:00] so -m checkstyle would need to be before /mediawiki/extensions/Popups [12:52:29] hm, but then you’d expect the error to say “file not found: -m” instead of “unknown option” [12:52:50] could it be a change in phan 1.2.6 [12:52:53] ? [12:54:20] yes [12:54:21] cause https://integration.wikimedia.org/ci/job/mwext-php70-phan-docker/28634/consoleFull [12:54:23] pass fine [12:54:39] but it uses phan/phan @ 0.8 [12:54:43] but it uses phan/phan @ 0.8.0 [12:55:08] (03CR) 10Hashar: [C: 03+2] Revert "Upgrade Phan image used for extension checks" [integration/config] - 10https://gerrit.wikimedia.org/r/509814 (https://phabricator.wikimedia.org/T219114) (owner: 10Awight) [12:58:03] (03Merged) 10jenkins-bot: Revert "Upgrade Phan image used for extension checks" [integration/config] - 10https://gerrit.wikimedia.org/r/509814 (https://phabricator.wikimedia.org/T219114) (owner: 10Awight) [12:58:11] tragic... [12:58:12] Lucas_WMDE: you might well be right [12:58:25] 10Release-Engineering-Team (Kanban), 10MediaWiki-Core-Testing, 10Patch-For-Review, 10Wikimedia-production-error (Shared Build Failure), 10phan: phan 1.2.6 is OOMing on MediaWiki core - https://phabricator.wikimedia.org/T219114 (10hashar) 05Resolved→03Open @awight noticed the container is broken somet... [13:08:47] 10Release-Engineering-Team (Kanban), 10MediaWiki-Core-Testing, 10Patch-For-Review, 10Wikimedia-production-error (Shared Build Failure), 10phan: phan 1.2.6 is OOMing on MediaWiki core - https://phabricator.wikimedia.org/T219114 (10hashar) The working one does: ` docker run releng/mediawiki-phan:0.1.11 /me... [13:11:28] 10Continuous-Integration-Infrastructure: CI: upgrade tox or allow to override its version per-project - https://phabricator.wikimedia.org/T222512 (10hashar) With the current CI we do not have a good way to use a different tox version per repository, the upgrade is for all repositories :-/ Though lot of tox jobs... [13:13:56] (03PS1) 10Hashar: docker: bump tox from 2.9.1 to 3.10.0 [integration/config] - 10https://gerrit.wikimedia.org/r/509824 (https://phabricator.wikimedia.org/T222512) [13:23:52] (03CR) 10Hashar: [C: 03+2] docker: bump tox from 2.9.1 to 3.10.0 [integration/config] - 10https://gerrit.wikimedia.org/r/509824 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [13:24:08] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: CI: upgrade tox or allow to override its version per-project - https://phabricator.wikimedia.org/T222512 (10hashar) a:03hashar [13:25:25] (03Merged) 10jenkins-bot: docker: bump tox from 2.9.1 to 3.10.0 [integration/config] - 10https://gerrit.wikimedia.org/r/509824 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [13:27:56] 10Beta-Cluster-Infrastructure, 10Editing-team, 10Release Pipeline, 10serviceops, and 3 others: Migrate Beta cluster services to use Kubernetes - https://phabricator.wikimedia.org/T220235 (10Ottomata) Could we use image version: latest in beta hiera? And somehow pull down the new latest and restart the ima... [13:29:08] (03PS1) 10Hashar: Allow image override in tox publish templates [integration/config] - 10https://gerrit.wikimedia.org/r/509839 (https://phabricator.wikimedia.org/T222512) [13:29:12] (03PS1) 10Hashar: Switch cumin to tox 3.10.0 [integration/config] - 10https://gerrit.wikimedia.org/r/509840 (https://phabricator.wikimedia.org/T222512) [13:29:14] (03PS1) 10Hashar: jjb: suffix some job-group with '-jobs' [integration/config] - 10https://gerrit.wikimedia.org/r/509841 [13:33:19] (03CR) 10Hashar: [C: 03+2] "noop in JJB as expected." [integration/config] - 10https://gerrit.wikimedia.org/r/509839 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [13:34:21] (03CR) 10Hashar: [C: 03+2] "Updated:" [integration/config] - 10https://gerrit.wikimedia.org/r/509840 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [13:34:41] (03CR) 10Hashar: [C: 03+2] "noop in jjb" [integration/config] - 10https://gerrit.wikimedia.org/r/509841 (owner: 10Hashar) [13:35:20] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: CI: upgrade tox or allow to override its version per-project - https://phabricator.wikimedia.org/T222512 (10hashar) That is done for `operations/software/cumin`. [13:35:53] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: CI: upgrade tox or allow to override its version per-project - https://phabricator.wikimedia.org/T222512 (10hashar) p:05Triage→03Normal [13:37:11] (03Merged) 10jenkins-bot: Allow image override in tox publish templates [integration/config] - 10https://gerrit.wikimedia.org/r/509839 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [13:37:15] (03Merged) 10jenkins-bot: Switch cumin to tox 3.10.0 [integration/config] - 10https://gerrit.wikimedia.org/r/509840 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [13:40:23] (03Merged) 10jenkins-bot: jjb: suffix some job-group with '-jobs' [integration/config] - 10https://gerrit.wikimedia.org/r/509841 (owner: 10Hashar) [13:45:29] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: CI: upgrade tox or allow to override its version per-project - https://phabricator.wikimedia.org/T222512 (10Volans) @hashar thanks for the patch, I'm getting this error though: ` 13:43:33 Unable to find image 'd... [13:46:36] 10Continuous-Integration-Infrastructure, 10Zuul: Zuul: Highlight relevant change on Zuul status page when following submit pipeline url - https://phabricator.wikimedia.org/T65399 (10Jdforrester-WMF) It'd be nice to get this back; can we convince upstream to take it? [13:47:19] (03PS1) 10Hashar: Bump most jobs to tox 3.10.0 [integration/config] - 10https://gerrit.wikimedia.org/r/509849 (https://phabricator.wikimedia.org/T222512) [13:48:04] (03PS2) 10Hashar: Bump most jobs to tox 3.10.0 [integration/config] - 10https://gerrit.wikimedia.org/r/509849 (https://phabricator.wikimedia.org/T222512) [13:48:15] (03CR) 10Hashar: [C: 03+2] Bump most jobs to tox 3.10.0 [integration/config] - 10https://gerrit.wikimedia.org/r/509849 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [13:48:27] (03CR) 10jerkins-bot: [V: 04-1] Bump most jobs to tox 3.10.0 [integration/config] - 10https://gerrit.wikimedia.org/r/509849 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [13:48:39] (03CR) 10jerkins-bot: [V: 04-1] Bump most jobs to tox 3.10.0 [integration/config] - 10https://gerrit.wikimedia.org/r/509849 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [13:49:39] !log Building docker image releng/tox:0.4.0 [13:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [13:56:48] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10puppet-compiler: Puppet catalog compiler - increasing max concurrent jobs - https://phabricator.wikimedia.org/T221969 (10herron) >>! In T221969#5176406, @hashar wrote: > I have deployed it on May 6th and thus puppet compile jobs should be... [13:59:29] (03CR) 10Hashar: [C: 03+2] "recheck" [integration/config] - 10https://gerrit.wikimedia.org/r/509849 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [13:59:48] (03CR) 10jerkins-bot: [V: 04-1] Bump most jobs to tox 3.10.0 [integration/config] - 10https://gerrit.wikimedia.org/r/509849 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [13:59:50] (03CR) 10jerkins-bot: [V: 04-1] Bump most jobs to tox 3.10.0 [integration/config] - 10https://gerrit.wikimedia.org/r/509849 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [14:01:42] (03CR) 10Hashar: [C: 03+2] "recheck" [integration/config] - 10https://gerrit.wikimedia.org/r/509849 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [14:02:04] (03CR) 10jerkins-bot: [V: 04-1] Bump most jobs to tox 3.10.0 [integration/config] - 10https://gerrit.wikimedia.org/r/509849 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [14:02:07] (03CR) 10jerkins-bot: [V: 04-1] Bump most jobs to tox 3.10.0 [integration/config] - 10https://gerrit.wikimedia.org/r/509849 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [14:15:15] (03CR) 10Hashar: [C: 03+2] "Somehow the image docker-registry.wikimedia.org/releng/tox:0.4.0 can not be found :(" [integration/config] - 10https://gerrit.wikimedia.org/r/509849 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [14:15:55] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: CI: upgrade tox or allow to override its version per-project - https://phabricator.wikimedia.org/T222512 (10hashar) Eventually the new image `docker-registry.wikimedia.org/releng/tox:0.4.0` can not be found on t... [14:17:52] (03PS1) 10Hashar: Revert "Switch cumin to tox 3.10.0" [integration/config] - 10https://gerrit.wikimedia.org/r/509856 [14:18:18] (03CR) 10Hashar: [C: 03+2] "The releng/tox:0.4.0 image can not be found :-(" [integration/config] - 10https://gerrit.wikimedia.org/r/509856 (owner: 10Hashar) [14:18:30] (03PS12) 10Kosta Harlan: Generate junit.xml for sonar-scanner's usage [integration/config] - 10https://gerrit.wikimedia.org/r/508019 (https://phabricator.wikimedia.org/T208522) [14:19:07] (03Merged) 10jenkins-bot: Bump most jobs to tox 3.10.0 [integration/config] - 10https://gerrit.wikimedia.org/r/509849 (https://phabricator.wikimedia.org/T222512) (owner: 10Hashar) [14:21:20] (03Merged) 10jenkins-bot: Revert "Switch cumin to tox 3.10.0" [integration/config] - 10https://gerrit.wikimedia.org/r/509856 (owner: 10Hashar) [14:23:27] (03PS1) 10Hashar: Revert "Bump most jobs to tox 3.10.0" [integration/config] - 10https://gerrit.wikimedia.org/r/509857 [14:23:34] (03CR) 10Hashar: [C: 03+2] Revert "Bump most jobs to tox 3.10.0" [integration/config] - 10https://gerrit.wikimedia.org/r/509857 (owner: 10Hashar) [14:25:49] (03Merged) 10jenkins-bot: Revert "Bump most jobs to tox 3.10.0" [integration/config] - 10https://gerrit.wikimedia.org/r/509857 (owner: 10Hashar) [14:28:41] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: CI: upgrade tox or allow to override its version per-project - https://phabricator.wikimedia.org/T222512 (10hashar) Fabian explained it is a replication issue in the docker registry. So just have to wait (tm).... [14:33:03] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Patch-For-Review: CI: upgrade tox or allow to override its version per-project - https://phabricator.wikimedia.org/T222512 (10hashar) 05Open→03Stalled Pending availability of the image T222210#5176863 [14:44:28] (03CR) 10Hashar: "My only concern is to have the Docker image to match Quibble version which is currently 0.0.31. So the container should be 0.0.31-3 not 0" (032 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/508019 (https://phabricator.wikimedia.org/T208522) (owner: 10Kosta Harlan) [14:48:33] !log if you build Docker containers, there is a long delay between it being build/published and it actually being available https://phabricator.wikimedia.org/T222210#5176863 known issue [14:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:54:11] 10Phabricator, 10Developer-Advocacy (Apr-Jun 2019): Re-evaluate our use of Phabricator Conpherence chat - https://phabricator.wikimedia.org/T127640 (10chasemp) Security team has an ongoing chat with the stewards via conpherence. So far we (Security) have proposed a move to IRC but there are various blockers.... [14:55:31] (03PS13) 10Kosta Harlan: Generate junit.xml for sonar-scanner's usage [integration/config] - 10https://gerrit.wikimedia.org/r/508019 (https://phabricator.wikimedia.org/T208522) [14:56:40] (03CR) 10Kosta Harlan: Generate junit.xml for sonar-scanner's usage (032 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/508019 (https://phabricator.wikimedia.org/T208522) (owner: 10Kosta Harlan) [14:57:35] (03PS31) 10Kosta Harlan: Establish codehealth pipeline, enable for GrowthExperiments only [integration/config] - 10https://gerrit.wikimedia.org/r/502606 (https://phabricator.wikimedia.org/T218598) [15:02:55] (03CR) 10Thcipriani: [C: 03+2] Generate junit.xml for sonar-scanner's usage [integration/config] - 10https://gerrit.wikimedia.org/r/508019 (https://phabricator.wikimedia.org/T208522) (owner: 10Kosta Harlan) [15:03:54] (03PS32) 10Kosta Harlan: Establish codehealth pipeline, enable for GrowthExperiments only [integration/config] - 10https://gerrit.wikimedia.org/r/502606 (https://phabricator.wikimedia.org/T218598) [15:04:29] (03Merged) 10jenkins-bot: Generate junit.xml for sonar-scanner's usage [integration/config] - 10https://gerrit.wikimedia.org/r/508019 (https://phabricator.wikimedia.org/T208522) (owner: 10Kosta Harlan) [15:06:40] !log updating docker-pkg images on contint1001 for https://gerrit.wikimedia.org/r/508019 [15:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:14:29] 10Gerrit, 10Release-Engineering-Team (Backlog), 10Wikimedia-Logstash, 10observability, and 2 others: Look into shoving gerrit logs into logstash - https://phabricator.wikimedia.org/T141324 (10Dzahn) [16:03:08] hashar, thcipriani and I see a different error for "docker pull docker-registry.wikimedia.org/releng/quibble-stretch-php70:0.0.31-3" than what you noted in T222210#5176863 [16:03:13] T222210: placeholder task for migration problems - https://phabricator.wikimedia.org/T222210 [16:03:22] "Error response from daemon: manifest for docker-registry.wikimedia.org/releng/quibble-stretch-php70:0.0.31-3 not found: manifest unknown: manifest unknown" [16:03:51] do you think it's the same underlying issue, of slow cross DC replication? [16:11:10] hrm, I get "manifest for xxx not found" vs unknown [16:11:19] probably worth adding to the ticket [16:11:32] ah. OK I'll add it. This is me attempting to pull locally [16:16:34] kostajh: thcipriani the underlying problem is the same [16:16:48] there is an open CR to fix that, as long as it merged it would be fixed [16:17:23] also keep in mind that eventually would get fixed when replication catches up, however it could take some time [16:23:13] kostajh: yeah that is the same issue :( [16:36:22] 10Release-Engineering-Team (Kanban), 10Release, 10Train Deployments: 1.34.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T220729 (10Jdforrester-WMF) 05Open→03Resolved Train was fully completed on Thursday. [16:43:08] hashar: kostajh thcipriani CR has been merged it should being getting fixed soon [16:43:16] cool, thanks fsero! [16:48:54] 10Continuous-Integration-Infrastructure: Reorganize CI phabricator projects - https://phabricator.wikimedia.org/T223134 (10hashar) [16:52:40] progress, I now get the "Unknown blob" error instead of the manifest unknown [16:53:34] kostajh: what I suspect is that the metadata are replicated first but the layers/tarball are not yet [16:53:48] so eventually after sometime newimage:0.0.42 is seen as well as the list of all its layers [16:53:55] but some of the listed layers are not replicated yet [16:53:59] and thus unknown blob [16:54:03] fsero: thank :] [16:55:04] fsero: I am not sure why I haven't noticed the issue earlier though. But maybe we haven't generated new images on thursday/friday so .. [16:55:07] Yippee, build fixed! [16:55:07] Project mediawiki-core-code-coverage-docker build #4247: 09FIXED in 1 hr 53 min: https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage-docker/4247/ [16:55:28] hashar: depends of how big are the layers when you upload a new one [16:55:42] that I have no idea [16:55:44] small layers are replicated quickly so new tags usually gets replicated easily [16:55:52] new images take some time [16:56:18] anyway is going to be fixed as long caching layer gets the updated [16:56:20] + docker itself probably has its own cache of requests to the registry somehow [16:56:21] *update [16:56:29] err no [16:56:34] it caches downloaded images [16:56:43] but always checks agains the registyr [16:56:50] ah [16:57:22] right now I have some instances able to download an image while others have unknown blob. Maybe that is a dns cache thing [16:57:34] at least some instances managed to pull the image [16:57:35] ! [16:58:22] 10Release-Engineering-Team, 10Operations, 10SRE-Access-Requests: Grant James Forrester access to contint-admins and contint-docker - https://phabricator.wikimedia.org/T223137 (10Jdforrester-WMF) [17:03:05] 10Release-Engineering-Team (Watching / External), 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Grant James Forrester access to contint-admins and contint-docker - https://phabricator.wikimedia.org/T223137 (10greg) Approved on my side, specific thing right now is the node6-node10 migration of CI... [17:03:24] 10Release-Engineering-Team (Watching / External), 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Grant James Forrester access to contint-admins and contint-docker - https://phabricator.wikimedia.org/T223137 (10hashar) [17:03:53] 10Release-Engineering-Team (Watching / External), 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Grant James Forrester access to contint-admins and contint-docker - https://phabricator.wikimedia.org/T223137 (10greg) [17:04:24] 10Release-Engineering-Team (Watching / External), 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Grant James Forrester access to contint-admins and contint-docker - https://phabricator.wikimedia.org/T223137 (10hashar) +1 we also need you to be added to the LDAP group `ciadmin` which grants write... [17:05:03] 10Release-Engineering-Team (Watching / External), 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Grant James Forrester access to contint-admins and contint-docker, and to the ciadmin LDAP group - https://phabricator.wikimedia.org/T223137 (10Jdforrester-WMF) [17:26:18] hey all, i just deployed a config change [17:26:26] that ´seems to have deployed correctly [17:26:52] but doesn't seem to have taken effect for some requests or servers. [17:27:18] is it possible there are long running processes thata serve multiple http requests and have old configi? [17:27:42] https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?refresh=1m&orgId=1&from=1557767558858&to=1557768458858&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All [17:28:07] you can see ^ that events are mostly turned off, but some smaller volume remains [17:28:13] it should drop to 0 [17:31:27] 10Continuous-Integration-Infrastructure: Reorganize CI phabricator projects - https://phabricator.wikimedia.org/T223134 (10Jdforrester-WMF) [17:31:46] 10Continuous-Integration-Infrastructure: Reorganize CI phabricator projects - https://phabricator.wikimedia.org/T223134 (10Jdforrester-WMF) [17:32:55] Krinkle: yt? looking for some help on ^^^^ not sure if you'd be the best to ping tho [17:33:00] 10Continuous-Integration-Infrastructure: Reorganize CI phabricator projects - https://phabricator.wikimedia.org/T223134 (10Jdforrester-WMF) [17:34:49] thcipriani: yt? [17:35:35] zomg [17:35:45] ottomata: Specifically prod or beta? [17:35:48] prod [17:36:06] w are working on T222962 [17:36:12] T222962: Use new eventgate chart release analytics for eventgate-analytics service. - https://phabricator.wikimedia.org/T222962 [17:36:16] hrm, are there specific servers that this is happening on? or sporadically on a bunch of servers? [17:36:18] we wanted to stop events, replace service, then turn on events. [17:36:26] i can easily tell from the cirrussearch-requests [17:36:29] since they log mediawiki host [17:36:55] kafkacat -C -b kafka-jumbo1001.eqiad.wmnet:9092 -t eqiad.mediawiki.cirrussearch-request -o end -c 100 | jq .mediawiki_host | sort | uniq -c | sort -nr [17:37:02] https://www.irccloud.com/pastebin/UUdDBIuN/ [17:37:14] What are the requests? [17:37:30] those are search logs [17:37:34] the api request logs are there too [17:37:37] higher volume thow [17:37:46] in total there are still ~400 msgs / second flowing through [17:37:50] but it should be 0. [17:38:07] these are 2 (now disabled) monolog channels [17:38:19] this is the config change i deployed [17:38:20] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/509866/2/wmf-config/InitialiseSettings.php [17:38:49] touch wmf-config/InitialiseSettings.php on deploy1001 and sync it again? [17:39:00] i can see that e.g. on mw1289 [17:39:02] the change went out fine [17:39:20] it is in /srv/mediawiki/wmf-config/InitialiseSettings.php there [17:39:30] i think the scap worked fine Reedy...right? [17:39:40] From the drop off, yeah [17:39:46] Could be for some reason it didn't trigger a recache/recompute of IS on all hosts [17:40:07] oh that is cached...by hhvm? [17:40:18] yeah, was just about to suggest a touch and resync [17:40:23] k will try [17:40:38] It's a bit of a crappy fix, but it *usually* helps for these weird straggler type things [17:41:19] -rw------- 1 www-data www-data 63642 May 13 17:11 conf-zhwikisource-hhvm [17:41:25] It looks like on disk it did [17:42:06] But maybe not for all... But that could just be that the host hasn't recieved a request for that wiki yet [17:42:26] huh hm ok this may have worked... [17:42:38] now don't see any in kafka [17:42:53] there are some race conditions that are possible https://phabricator.wikimedia.org/T217830#5009234 [17:43:43] ottomata: FWIW, touch and re-sync has been the solution for many things as long as I've been deploying... And that's a good 8 years at this point :) [17:43:48] huh. [17:43:55] hah ok good to know [17:44:04] definitely dropping further on grafana [17:44:10] might add that to https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#Change_wiki_configuration [17:44:32] Could be worth it noting [17:45:16] Dropped from nearly 400 to about 100 req/s [17:45:22] 10Deployments, 10Release-Engineering-Team (Backlog), 10HHVM, 10Wikimedia-Incident: Figure out why HHVM kept running stale code for hours - https://phabricator.wikimedia.org/T181833 (10thcipriani) I think I have a reasonable explanation for this over in: T217830#5009234 The tl;dr is that there's a possible... [17:45:27] 50 [17:45:34] Definitely seems to have helped [17:45:54] I guess it's one of these that if it broke like this more often, we might've fixed it [17:46:11] Whereas it's mostly an annoyance, and easily enough worked around [17:46:18] heh, 5 req/s [17:46:18] ottomata: still need help? what's the context. [17:46:32] Krinkle: caching sucks [17:46:37] Yes there are long-running processes and request. Processes of 8 hours and multiple days are not unusual. [17:46:40] TLDR is touch IS.php and sync fixes it [17:46:44] video transcodes and maintenance scripts. [17:46:50] But fleet restarts suck more. [17:48:27] 0.005% of the requests before according to grafana [17:48:31] That's practically background noise [17:48:48] 10Release-Engineering-Team: Echo Selenium tests - Build failure - Cannot find module wdio-mediawiki/Util - https://phabricator.wikimedia.org/T223150 (10Etonkovidova) [17:50:42] Yeah, we've come to accept a certain level of fatals in production, sadly. [17:51:36] Not fatals [17:51:58] 10Release-Engineering-Team: Echo Selenium tests - Build failure - Cannot find module wdio-mediawiki/Util - https://phabricator.wikimedia.org/T223150 (10Jdforrester-WMF) Possibly an odd `npm install` failure? [17:55:26] 10Release-Engineering-Team: Echo Selenium tests - Build failure - Cannot find module wdio-mediawiki/Util - https://phabricator.wikimedia.org/T223150 (10Jdforrester-WMF) Everything seems fine in the npm logs, for the currently-accepted values of 'fine'. [17:56:58] Hi, getting https://ctrlv.cz/UGQo when https://gerrit.wikimedia.org/r/c/wikimedia-cz/tracker/+/509904 is loaded. That's to be expected or a bug? [17:57:52] Reedy: ya some of those are test events from the service checker stuff [17:57:56] so we are good. [17:57:57] thank you so much! [17:58:02] Krinkle: Reedy figured it out, thank you! [18:00:43] 10Release-Engineering-Team, 10Operations, 10Release Pipeline, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Pablo-WMDE) Hi @akosiaris - thanks for getting back to us. > sending a Host: HTTP for the identification of the exact project.... [18:02:18] 10Release-Engineering-Team: Echo Selenium tests - Build failure - Cannot find module wdio-mediawiki/Util - https://phabricator.wikimedia.org/T223150 (10Krinkle) The Echo extension currently specifies in its `package.json` file that it depends on `wdio-mediawiki` version 0.1.7. The `Util` class was added in vers... [18:05:43] 10Release-Engineering-Team (Kanban), 10Growth-Team, 10Notifications: Echo Selenium tests - Build failure - Cannot find module wdio-mediawiki/Util - https://phabricator.wikimedia.org/T223150 (10Jdforrester-WMF) Ha. Thanks, Timo. [18:06:15] 10Release-Engineering-Team (Watching / External), 10Growth-Team, 10Notifications: Echo Selenium tests - Build failure - Cannot find module wdio-mediawiki/Util - https://phabricator.wikimedia.org/T223150 (10Jdforrester-WMF) [18:07:07] James_F, hey, sorry I'd had to go before you got back to IRC - was just going to ask you what you thought of prioritising that commons local description problem, but I see it's been/being taken care of [18:09:48] Krenair: Ah, cool. :-) [18:18:51] Reedy: should I always do the touch + sync again for config changes just in case? [18:19:19] Shouldn't be any need to do it generally, but you can [18:19:45] am worried that some servers wont' pickup the change...and I won't notice [18:20:06] Well, the graphs should go back upto a similar amount [18:20:18] true, but if e.g. 1 or 2 servers don't send requests [18:20:22] we might not notice that in the numbers [18:20:35] It'll catch up eventually, from other deploys etc [18:20:40] hm ok. true. [18:20:45] alright, i'll let it be if it looks about right. [18:20:47] thanks [18:21:50] It's heading in the right direction [18:23:03] looks to be about 2/3rds of what it was before [18:25:48] ottomata: Amusingly, it seems to be about the same amount down from before you turned it off.. as was still sending it [18:26:27] hmm, Reedy i see it at about the same, no? [18:26:35] 9-9.5K / second [18:26:48] 0.15-0.3K/s down [18:26:54] But again, could just be background noise :) [18:26:58] would be hard to notice a missing 400K / second [18:27:04] its a 5 minute average in those dashboards anyway [18:27:26] Reedy: you rthink I should just touch and sync anyway? JUST IN CASE?! [18:27:26] :) [18:27:50] Haha [18:27:53] It'd be interesting to see [18:27:54] sorry not 400K/second [18:27:57] 400/second [18:28:00] sure why not ok. [18:28:15] it's going up a bit more anyway [18:30:23] grafana does show a slight further rise [18:30:39] 0.1K-0.15K/s [18:31:02] Certainly won't have harmed things [18:32:44] aye [18:32:45] thanaks [18:33:21] np [18:39:22] PROBLEM - Host deployment-eventgate-analytics-1 is DOWN: CRITICAL - Host Unreachable (172.16.5.189) [18:43:02] ^ is me [18:43:05] i just deleted it... [18:43:12] why is it pinging...... hm [19:07:36] PROBLEM - Puppet staleness on deployment-eventlog05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [43200.0] [19:37:33] (03PS33) 10Thcipriani: Establish codehealth pipeline, enable for GrowthExperiments only [integration/config] - 10https://gerrit.wikimedia.org/r/502606 (https://phabricator.wikimedia.org/T218598) (owner: 10Kosta Harlan) [19:43:05] (03CR) 10Thcipriani: [C: 03+2] "Jobs deployed/working, merging to for zuul changes." [integration/config] - 10https://gerrit.wikimedia.org/r/502606 (https://phabricator.wikimedia.org/T218598) (owner: 10Kosta Harlan) [19:47:36] !log reloading zuul to deploy https://gerrit.wikimedia.org/r/502606 [19:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [19:47:48] (03Merged) 10jenkins-bot: Establish codehealth pipeline, enable for GrowthExperiments only [integration/config] - 10https://gerrit.wikimedia.org/r/502606 (https://phabricator.wikimedia.org/T218598) (owner: 10Kosta Harlan) [20:10:18] (03PS1) 10Kosta Harlan: Codehealth: Make patch job voting [integration/config] - 10https://gerrit.wikimedia.org/r/509930 (https://phabricator.wikimedia.org/T218598) [20:14:48] Hey folks. I just lost my terminal connection during a scap deploy (derp). Should I restart the deployment process? [20:17:27] for a scap3 deploy? I think that should work fine. It shouldn't try to redeploy to hosts it already deployed to (that is, it knows what's deployed and what you're trying to deploy). [20:18:22] (03CR) 10Thcipriani: [C: 03+2] Codehealth: Make patch job voting [integration/config] - 10https://gerrit.wikimedia.org/r/509930 (https://phabricator.wikimedia.org/T218598) (owner: 10Kosta Harlan) [20:19:30] (03Merged) 10jenkins-bot: Codehealth: Make patch job voting [integration/config] - 10https://gerrit.wikimedia.org/r/509930 (https://phabricator.wikimedia.org/T218598) (owner: 10Kosta Harlan) [20:21:50] !log reloading zuul to deploy https://gerrit.wikimedia.org/r/509930 [20:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:36:51] PROBLEM - Free space - all mounts on deployment-fluorine02 is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine02.diskspace._srv.byte_percentfree (<20.00%) [20:52:43] (03PS1) 10Kosta Harlan: Codehealth pipeline: Ignore exit code of npm test:unit [integration/config] - 10https://gerrit.wikimedia.org/r/509943 (https://phabricator.wikimedia.org/T218598) [20:57:10] (03PS1) 10Kosta Harlan: Codehealth: Allow jobs to execute concurrently [integration/config] - 10https://gerrit.wikimedia.org/r/509944 (https://phabricator.wikimedia.org/T218598) [21:07:44] 10Release-Engineering-Team (Kanban), 10Scap, 10User-zeljkofilipin: Problems deploying dblists/commonsuploads.dblist - https://phabricator.wikimedia.org/T217830 (10Krinkle) I think the simpler solution would be to do the same as what we generally to avoid this program in production code, e.g. in MediaWiki cor... [21:09:03] 10Release-Engineering-Team (Kanban), 10Scap, 10Performance-Team (Radar), 10User-zeljkofilipin: Problems deploying dblists/commonsuploads.dblist - https://phabricator.wikimedia.org/T217830 (10Krinkle) [21:09:27] 10Deployments, 10Release-Engineering-Team (Backlog), 10HHVM, 10Performance-Team (Radar), 10Wikimedia-Incident: Figure out why HHVM kept running stale code for hours - https://phabricator.wikimedia.org/T181833 (10Krinkle) [22:13:38] Why do we do postmerge jobs for l10n update? [22:14:39] 10Continuous-Integration-Config, 10MediaWiki-extensions-Scribunto, 10Wikidata, 10Patch-For-Review, 10Wikidata-Campsite (Wikidata-Campsite-Iteration-∞): [Task] Add Scribunto to extension-gate in CI - https://phabricator.wikimedia.org/T125050 (10Addshore) Sounds okay to me! [22:41:29] PROBLEM - Citoid on deployment-sca01 is CRITICAL: connect to address 172.16.5.13 and port 1970: Connection refused [22:51:31] RECOVERY - Citoid on deployment-sca01 is OK: HTTP OK: HTTP/1.1 200 OK - 921 bytes in 0.026 second response time [23:15:15] 10Phabricator, 10Developer-Advocacy (Apr-Jun 2019): Re-evaluate our use of Phabricator Conpherence chat - https://phabricator.wikimedia.org/T127640 (10Peachey88) Could we restrict creating new rooms to just #trusted-contributors etc, There are valid and useful rooms, Site-Requests seems to be heavily used and... [23:26:35] 10Phabricator, 10Developer-Advocacy (Apr-Jun 2019): Re-evaluate our use of Phabricator Conpherence chat - https://phabricator.wikimedia.org/T127640 (10greg) >>! In T127640#5178820, @Peachey88 wrote: > Could we restrict creating new rooms to just #trusted-contributors etc, There are valid and useful rooms, Site... [23:51:19] 10Phabricator, 10Developer-Advocacy (Apr-Jun 2019): Re-evaluate our use of Phabricator Conpherence chat - https://phabricator.wikimedia.org/T127640 (10Dzahn) > So far we (Security) have proposed a move to IRC but there are various blockers. > I think this is worth discussing more what can be done here. Maybe...