[04:07:59] 10VPS-project-codesearch, 10User-DannyS712: Codesearch down (for a while) - https://phabricator.wikimedia.org/T267507 (10DannyS712) [06:49:56] 10VPS-project-codesearch, 10User-DannyS712: Codesearch down (for a while) - https://phabricator.wikimedia.org/T267507 (10Ladsgroup) Technically codesearch is not down, gerrit-replica is down ` Nov 08 13:18:09 codesearch6 docker[8772]: fatal: unable to access 'https://gerrit-replica.wikimedia.org/r/mediawiki/co... [07:30:39] 10VPS-project-codesearch, 10User-DannyS712: Codesearch down (for a while) - https://phabricator.wikimedia.org/T267507 (10Ladsgroup) 05Open→03Resolved a:03elukey So the codesearch is back online due to restart of gerrit-replica by @elukey. We should have better alarm, etc. and also move it to eqiad (IMHO)... [07:32:56] 10Continuous-Integration-Config: How to set up CI to have "PHP Parsoid" enabled? - https://phabricator.wikimedia.org/T267511 (10Osnard) [08:14:23] 10Gerrit, 10Release-Engineering-Team (Development services): Gerrit log4j complains when server is stopped - https://phabricator.wikimedia.org/T267516 (10hashar) [08:15:13] 10Gerrit: gerrit-replica was down for a while and no one noticed - https://phabricator.wikimedia.org/T267517 (10Ladsgroup) [08:23:44] 10Gerrit, 10Release-Engineering-Team (Development services): Gerrit log4j complains when server is stopped - https://phabricator.wikimedia.org/T267516 (10hashar) 05Open→03Declined Merely filed this one as a reference for later. Maybe it complains cause log4j got unconfigured too early in the shutdown pro... [08:55:46] 10Gerrit, 10observability: gerrit-replica was down for a while and no one noticed - https://phabricator.wikimedia.org/T267517 (10Peachey88) [09:05:06] 10Gerrit, 10Release-Engineering-Team (Development services), 10observability: gerrit-replica was down for a while and no one noticed - https://phabricator.wikimedia.org/T267517 (10hashar) [09:08:36] 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), 10git-protocol-v2: Gerrit out of heap - https://phabricator.wikimedia.org/T263008 (10hashar) That happened on the Gerrit replica gerrit2001: ` Nov 07 13:43:35 gerrit2001 java[30197]... [09:17:09] 10Gerrit, 10Release-Engineering-Team (Development services), 10observability: gerrit-replica was down for a while and no one noticed - https://phabricator.wikimedia.org/T267517 (10hashar) It apparently went out of memory which is T263008. I have started it again as part of an unrelated routine maintenance t... [09:24:59] 10Gerrit, 10Release-Engineering-Team (Development services), 10observability: gerrit-replica was down for a while and no one noticed - https://phabricator.wikimedia.org/T267517 (10elukey) @hashar as FYI there was a systemd restart before yours, but an OOM is not enough to kill a jvm (you'd need something lik... [09:25:27] 10Gerrit, 10Release-Engineering-Team (Development services), 10observability: gerrit-replica was down for a while and no one noticed - https://phabricator.wikimedia.org/T267517 (10hashar) a:03hashar That is the exact same issue we had with ElasticSearch at T76090 . We need to pass `-XX:+ExitOnOutOfMemoryEr... [09:29:06] 10Gerrit, 10Release-Engineering-Team (Development services), 10observability, 10Patch-For-Review: gerrit-replica was down for a while and no one noticed - https://phabricator.wikimedia.org/T267517 (10hashar) >>! In T267517#6612165, @elukey wrote: > @hashar as FYI there was a systemd restart before yours, b... [10:06:46] 10Gerrit, 10Release-Engineering-Team (Development services), 10observability, 10Patch-For-Review: gerrit-replica was down for a while and no one noticed - https://phabricator.wikimedia.org/T267517 (10hashar) 05Open→03Resolved Should be good now. [10:49:58] 10VPS-project-codesearch, 10User-DannyS712: Codesearch down (for a while) - https://phabricator.wikimedia.org/T267507 (10Nikerabbit) Umm, I am not seeing results that I am expecting to see: https://codesearch.wmcloud.org/search/?q=compactlinks&i=nope&files=&repos= is this related? [10:51:57] 10VPS-project-codesearch: Gerrit repositories missing from codesearch - https://phabricator.wikimedia.org/T267533 (10Nikerabbit) [10:52:34] also Phab is slow... [10:52:38] 10VPS-project-codesearch, 10User-DannyS712: Codesearch down (for a while) - https://phabricator.wikimedia.org/T267507 (10Nikerabbit) Filed {T267533} for clarity. [11:27:53] 10VPS-project-codesearch, 10User-Ladsgroup: Gerrit repositories missing from codesearch - https://phabricator.wikimedia.org/T267533 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup I think gerrit-replica going down corrupted the data. Rewrote the config (by running python write_config.py) and restarted code... [11:31:12] thanks Amir1 [11:31:46] Nikerabbit: thank you for noticing it! [12:30:21] 10Gerrit, 10Wikimedia-GitHub, 10Patch-For-Review: mediawiki/extensions/WSOAuth Github and Gerrit repo have diverged - https://phabricator.wikimedia.org/T263955 (10Aklapper) @Xxmarijnw: Would you have time to follow the recommendations above? [12:57:45] 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: en.wikipedia.beta.wmflabs.org down - https://phabricator.wikimedia.org/T267551 (10zeljkofilipin) [12:58:22] 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible, 10User-zeljkofilipin: en.wikipedia.beta.wmflabs.org down - https://phabricator.wikimedia.org/T267551 (10zeljkofilipin) p:05Triage→03High I'm not sure if this is UBN, but it's at least high. 😬 [13:14:36] 10Scap, 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10Gehel) 05Open→03Resolved [13:58:35] (03PS2) 10Hashar: Upgrade javamelody from 1.83.0 to 1.86.0 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/630551 (https://phabricator.wikimedia.org/T232678) [13:58:53] 10Beta-Cluster-Infrastructure: Puppet failures on many hosts - https://phabricator.wikimedia.org/T267006 (10Gilles) [14:01:51] 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible, 10User-zeljkofilipin: en.wikipedia.beta.wmflabs.org down - https://phabricator.wikimedia.org/T267551 (10Gilles) Likely caused by T267006 [14:03:48] (03CR) 10Hashar: [V: 03+2 C: 03+2] "Upgraded to latest master which brings in JavaMelody 1.86.0" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/630551 (https://phabricator.wikimedia.org/T232678) (owner: 10Hashar) [14:04:46] 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), 10Patch-For-Review: Update JavaMelody on Gerrit to 1.86.0 - https://phabricator.wikimedia.org/T232678 (10hashar) [14:18:06] !log deployment-prep: removing role::beta::availability_collector from deployment-cache06 since that broke puppet. The role no more exists # T267006 [14:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:18:12] T267006: Puppet failures on many hosts - https://phabricator.wikimedia.org/T267006 [14:19:03] 10Beta-Cluster-Infrastructure: Puppet failures on many hosts - https://phabricator.wikimedia.org/T267006 (10hashar) [14:40:19] 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): Update JavaMelody on Gerrit to 1.86.0 - https://phabricator.wikimedia.org/T232678 (10hashar) 05Open→03Resolved [14:46:52] 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible, 10User-zeljkofilipin: Beta Cluster down - https://phabricator.wikimedia.org/T267551 (10Nintendofan885) [14:47:44] 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible, 10User-zeljkofilipin: Beta Cluster is down - https://phabricator.wikimedia.org/T267551 (10Nintendofan885) [14:48:05] 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible, 10User-zeljkofilipin: Beta Cluster is down - https://phabricator.wikimedia.org/T267551 (10Nintendofan885) It's effecting the other beta.wmflabs.org subdomains as well [14:48:20] zeljkof: the last time beta went down I think it was UBN fwiw [14:52:00] Not like marking it UBN will really make much difference [14:53:15] maybe not for getting it fixed, but later someone could say that we had this many UBNs.... [14:53:20] True, anyone who knows what to do is probably aware anyway [14:58:08] 10Beta-Cluster-Infrastructure, 10Operations, 10Traffic: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10Gilles) [14:59:19] 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible, 10User-zeljkofilipin: Beta Cluster is down - https://phabricator.wikimedia.org/T267551 (10Gilles) [15:00:28] 10Beta-Cluster-Infrastructure, 10Operations, 10Traffic: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10Gilles) [15:01:10] 10Beta-Cluster-Infrastructure, 10Operations, 10Traffic: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10Gilles) [15:06:28] RhinosF1: I was not sure if we have rules what is an UBN, I know production problems are, but I'm not sure about beta [15:06:50] It was more an observation [15:06:57] Not a clue what the actual rule is [15:13:03] I thought it's better to save UBNs for real emergencies :) [15:13:20] beta down is a problem, but I'm not sure if it's an emergency [15:13:44] I think it is, because it causes problems for me, but then I'm frequently wrong :D [15:24:01] (03PS1) 10Hashar: Add metrics-reporter-prometheus plugin [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/640174 (https://phabricator.wikimedia.org/T184086) [15:54:45] 10Phabricator, 10Project-Admins: Create a project and a Space for the Basque Wikimedians User Group - https://phabricator.wikimedia.org/T175287 (10Aklapper) >>! In T175287#6317298, @Aklapper wrote: > * [Some tasks](https://phabricator.wikimedia.org/maniphest/?ids=176659,178676,178680,180592,180593,180594,18059... [16:07:01] 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: replace production deployment servers - https://phabricator.wikimedia.org/T265963 (10Papaul) [16:20:46] (03PS1) 10Hashar: Add metrics-reporter-jmx plugin [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/640177 (https://phabricator.wikimedia.org/T184086) [16:22:25] 10Release-Engineering-Team (Pipeline), 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), 10MW-on-K8s: Experiment with generating json config - https://phabricator.wikimedia.org/T267057 (10dancy) Execution of wmf-config/CommonSettings.php requires an MW installation. [16:43:33] (03CR) 10Hashar: [C: 04-1] "The metrics do show on the JavaMelody Mbeans tree , but they are not exposed in /monitoring?format=prometheus T184086#6613887" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/640177 (https://phabricator.wikimedia.org/T184086) (owner: 10Hashar) [16:52:50] 10Release-Engineering-Team, 10WVUI, 10Vue.js (Vue.js-Search): Create service account for npm - https://phabricator.wikimedia.org/T267280 (10nnikkhoui) [16:52:58] 10Release-Engineering-Team, 10WVUI, 10Vue.js (Vue.js-Search): Create service account for npm - https://phabricator.wikimedia.org/T267280 (10nnikkhoui) Tagging Release Engineering Team, wondering if there's a best way to do this, if such an account already exists, or if i can just make one : ) [17:21:43] 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO, 10Operations, and 2 others: Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) I then tried **`metrics-reporter-prometheus`**. The metrics are exposed under the plugin namesp... [17:30:28] (03CR) 10Hashar: "The Gerrit internal metrics are then exposed on https://gerrit.wikimedia.org/r/monitoring?part=mbeans which is helpful for administrators " [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/640177 (https://phabricator.wikimedia.org/T184086) (owner: 10Hashar) [17:32:50] (03CR) 10Hashar: "That will exposes Gerrit internal metrics at https://gerrit.wikimedia.org/r/plugins/metrics-reporter-prometheus/metrics . Once deployed, w" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/640174 (https://phabricator.wikimedia.org/T184086) (owner: 10Hashar) [17:46:05] 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO, 10Operations, and 2 others: Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) a:03hashar [17:46:29] 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), 10git-protocol-v2: Gerrit out of heap - https://phabricator.wikimedia.org/T263008 (10hashar) [17:46:38] 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO, 10Operations, and 2 others: Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) [17:47:08] 10Release-Engineering-Team-TODO, 10WVUI, 10Vue.js (Vue.js-Search): Create service account for npm - https://phabricator.wikimedia.org/T267280 (10thcipriani) [17:47:26] 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), 10git-protocol-v2: Gerrit out of heap - https://phabricator.wikimedia.org/T263008 (10hashar) Looks like I forgot to flag monitoring as a requirement to solve this one: T184086 . Wi... [17:53:32] 10Release-Engineering-Team-TODO, 10WVUI, 10Vue.js (Vue.js-Search): Create service account for npm - https://phabricator.wikimedia.org/T267280 (10thcipriani) >>! In T267280#6613923, @nnikkhoui wrote: > Tagging Release Engineering Team, wondering if there's a best way to do this, if such an account already exi... [17:55:26] (03CR) 10Ahmon Dancy: [C: 04-1] "The following sequence fails:" (032 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/634929 (owner: 10Lars Wirzenius) [18:13:45] 10Phabricator, 10Developer-Advocacy (Jan-Mar 2021), 10Epic, 10Patch-For-Review: (Semi)automatically close Phabricator tickets with status "stalled" after a while - https://phabricator.wikimedia.org/T252522 (10Aklapper) > 100 stalled tasks for more than 36 months, 137 for more than 24 months, 279 for more t... [19:39:46] (03PS1) 10Ahmon Dancy: Allow the default l10n language to be controlled by SCAP_MW_LANG [tools/scap] - 10https://gerrit.wikimedia.org/r/640231 [19:40:48] (03CR) 10DannyS712: Allow the default l10n language to be controlled by SCAP_MW_LANG (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/640231 (owner: 10Ahmon Dancy) [19:43:51] (03PS2) 10Ahmon Dancy: Allow default l10n language to be set by SCAP_MW_LANG environment var [tools/scap] - 10https://gerrit.wikimedia.org/r/640231 [19:44:14] (03CR) 10Ahmon Dancy: Allow default l10n language to be set by SCAP_MW_LANG environment var (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/640231 (owner: 10Ahmon Dancy) [19:51:04] (03CR) 10DannyS712: Allow default l10n language to be set by SCAP_MW_LANG environment var (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/640231 (owner: 10Ahmon Dancy) [19:54:55] 10Gerrit, 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): check gerrit-replica healthcheck alerts - https://phabricator.wikimedia.org/T247186 (10thcipriani) a:05thcipriani→03None [20:19:44] hi! any thoughts on this patch, anyone? https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/637082 [20:52:20] 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10jijiki) There are a couple of things that we will need to keep in mind: * **mcrouter... [20:56:52] 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10jijiki) [21:03:22] 10Phabricator, 10Release-Engineering-Team: Phab email account verification not received - https://phabricator.wikimedia.org/T267590 (10Reedy) [21:06:59] 10Phabricator, 10Release-Engineering-Team: Phab email account verification not received - https://phabricator.wikimedia.org/T267590 (10Reedy) I just had a quite poke... It seems on your "MediaWiki Account" (ala `JCross (WMF)`) it's still on your `-ctr` email. Can you change it on https://www.mediawiki.org/wik... [21:13:10] 10Phabricator: Phab email account verification not received - https://phabricator.wikimedia.org/T267590 (10Aklapper) Email addresses in Phab are separate from wiki stuff. Phab DB shows both email addresses for @jcross' account with the old address set as primary. @jcross: What's shown on https://phabricator.wiki... [21:29:15] 10Phabricator: Phab email account verification not received - https://phabricator.wikimedia.org/T267590 (10Jcross) @Aklapper I am seeing both -ctr as primary and jcross as added but needing verification. I've clicked "verify" several times over several days and never received anything. [21:36:29] (03PS1) 10Ahmon Dancy: Test case for 640231 [tools/train-dev] - 10https://gerrit.wikimedia.org/r/640253 [21:37:07] (03CR) 10Ahmon Dancy: "Can be tested with https://gerrit.wikimedia.org/r/c/mediawiki/tools/train-dev/+/640253" [tools/scap] - 10https://gerrit.wikimedia.org/r/640231 (owner: 10Ahmon Dancy) [22:26:20] 10Phabricator, 10Mail: Phab email account verification not received - https://phabricator.wikimedia.org/T267590 (10Aklapper) [22:33:25] 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release, 10Train Deployments, 10User-brennen: 1.36.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T263182 (10brennen) [22:51:43] 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release, 10Train Deployments, 10User-brennen: 1.36.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T263182 (10brennen) Current status: Deployed to all wikis and expected to stay there, with the footnote that @cscott has some patches...