[06:36:31] 10DBA, 10Wikimedia-General-or-Unknown, 10Security: Move private wikis to a dedicated cluster - https://phabricator.wikimedia.org/T101915 (10Marostegui) Unfortunately, what was said on 2015 (T101915#1351273) no longer applies. This would be a complicate effort nowadays (see how difficult and risky is to move... [06:39:56] 10DBA, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10Marostegui) @jijiki reading the task it is not clear to me what's needed from us (#dba). Is it a heads up that you'll be running `updateCollation.php` against the wikis listed on T264991#... [06:46:50] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) pc1 eqiad: [x] pc1010 [] pc1007 [06:47:16] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [06:57:11] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [07:14:20] 10DBA, 10Operations, 10Orchestrator, 10Patch-For-Review: orchestrator: Use ssl for talking to db servers - https://phabricator.wikimedia.org/T267401 (10Marostegui) p:05Triage→03Medium [07:14:27] 10Blocked-on-schema-change, 10DBA: Drop default of ip_changes.ipc_rev_timestamp - https://phabricator.wikimedia.org/T267399 (10Marostegui) p:05Triage→03Medium [07:16:23] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) 05Stalled→03Open a:03Marostegui [07:21:14] 10DBA, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) @Marostegui yes, it is a headsup for your radar, thank you! [07:22:22] 10DBA, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10Marostegui) Thank you! What's the expected impact of `updateCollation.php`? [07:51:35] did you see my notes on the SRE meeting doc? [08:09:46] 10DBA, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) One other sanity check for the rollout (in particular when the whole server batch gets upgraded on the 16th); ` cumin foo* 'php -r "var_dump(IntlChar::getUnicodeVersi... [08:31:47] 10DBA, 10CheckUser: Monitor the growth of CheckUser tables at enwiki and few other very large wikis - https://phabricator.wikimedia.org/T267275 (10Marostegui) [08:36:34] 10DBA, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) [09:32:13] 10DBA, 10CheckUser: Monitor the growth of CheckUser tables at enwiki and few other very large wikis - https://phabricator.wikimedia.org/T267275 (10Marostegui) [09:32:29] 10DBA, 10CheckUser: Monitor the growth of CheckUser tables at enwiki and few other very large wikis - https://phabricator.wikimedia.org/T267275 (10Marostegui) [10:54:03] 10DBA, 10Beta-Cluster-Infrastructure, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) I will updating to icu63 in deployment-prep, with Moritz looking on. This will likely happen later today, and I'll post updates about the prog... [12:11:14] 10Blocked-on-schema-change, 10DBA: Drop default of ip_changes.ipc_rev_timestamp - https://phabricator.wikimedia.org/T267399 (10Marostegui) It can be done online, so no master failover required: ` MariaDB [test]> ALTER TABLE /*_*/ip_changes ALTER COLUMN ipc_rev_timestamp DROP DEFAULT, ALGORITHM=INPLACE, LOCK=NO... [12:14:46] 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui) As this modifies datatypes, this would require a master failover for each master. [12:19:13] 10Blocked-on-schema-change, 10DBA: Drop default of protected_titles.pt_expiry - https://phabricator.wikimedia.org/T267335 (10Marostegui) It can be done ONLINE, so no master failover is required ` MariaDB [test]> ALTER TABLE /*_*/protected_titles ALTER pt_expiry DROP DEFAULT, ALGORITHM=INPLACE, LOCK=NONE; Query... [12:36:04] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) Expanded the PV on each host: ` ===== NODE GROUP ===== (8) clouddb[1013-1020].eqiad.wmnet ----- OUTPUT of 'sudo df -hT /srv ; pvs' ----- Filesystem Type... [13:24:39] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) m1 eqiad: [x] db1117 [] db1080 m2 eqiad: [x] db1117 [] db1107 m3 eqiad: [x] db1117 [] db1132 m5 eqiad: [x] db1117 [] db1128 [13:24:41] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [13:26:39] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [13:29:31] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [13:43:26] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [14:30:26] 10DBA, 10Beta-Cluster-Infrastructure, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10Reedy) >>! In T264991#6615421, @Marostegui wrote: > Thank you! What's the expected impact of `updateCollation.php`? Many many categorylinks rows being up... [14:30:39] lol, that caused me to be pinged in 4 channels [14:34:35] 10DBA, 10Beta-Cluster-Infrastructure, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) >>! In T264991#6615421, @Marostegui wrote: > Thank you! What's the expected impact of `updateCollation.php`? We will run on one wiki and see how... [14:40:47] 10DBA, 10Beta-Cluster-Infrastructure, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10Marostegui) Thank you @Reedy! [14:40:49] Reedy: ^ there you go another 4 pings [14:41:03] Doesn't that make it 5? [14:41:14] Reedy: 6 now [14:55:51] 10DBA, 10Orchestrator: Investigate using orchestrator tags for different type of hosts. - https://phabricator.wikimedia.org/T266869 (10Marostegui) [15:28:35] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) [15:41:32] 10DBA, 10Beta-Cluster-Infrastructure, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) Upgrade plan on deployment-prep: [] add profile::mediawiki::php::icu63: true to hiera for deployment-prep project prefix; this will only have... [15:42:32] 10DBA, 10Beta-Cluster-Infrastructure, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) Sounds good! [16:19:19] 10DBA, 10Beta-Cluster-Infrastructure, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) Note that I'm running cumin 'O{project:deployment-prep name:^deployment-mediawiki-[0-9]+$ } or O{project:deployment-prep name:^deployment... [17:12:05] 10DBA, 10Beta-Cluster-Infrastructure, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) Can't proceed at the moment, puppet sync to deployment-prep has been broken since Nov 6. Log excerpts from the earliest error: ` 2020-11-06T20... [17:23:27] jynus around? [17:25:44] yes [17:26:05] db1139? [17:26:09] cmjohnson1: ^ [17:26:36] yes, I would like to do that today if I can since we're not here the rest of the week [17:27:03] ok, although if it doesn't work, we would like it back up today, ok, for the same reason :-D [17:27:11] yes [17:27:20] should only be down for 5 minutes [17:27:21] thank, give me 5, sorry [17:27:29] take your time [17:27:33] I was caught unprepared :-D [17:27:58] I just realized it, the task is assigned to John [17:28:15] I think I didn't touch it after reopening [17:28:29] as he did the mainboard [17:28:38] so whoever can help [17:28:52] yep, here for you! [17:29:08] 90% chance is just reaseating [17:29:13] 10% it is bad stick [17:29:18] *reseating [17:29:19] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) @Jclark-ctr: Has the defective HP mainboard been sent back to HP yet? They are spamming my inbox about it =] [17:29:25] that is what I am thinking [17:29:48] dbs have lots of memory! [17:33:22] cmjohnson1: it is shutting down now, I will comment on the ticket for coordination [17:33:32] awesome! Thanks [17:34:20] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) @Cmjohnson host should be shut down right now after stopping mysql cleanly- you are free to disconnect/open/check ram now. Thank you! [17:34:35] thanks to you cmjohnson always super helpful!!!! [17:34:54] thanks [17:39:59] jynus powering up..I reseated all the DIMM [17:40:23] * jynus crossing fingers [17:41:05] check if message on post or I can also check on os [17:42:55] I did login to console..it's too late now unless I can reboot it [17:43:44] *did not [17:43:49] mm, os didn't came back [17:44:50] it booted now [17:45:14] Mem: 418922 :-( [17:46:23] in fact, that is less than before! [17:46:41] It used to report Mem: 483434 [17:48:06] hrm...I can try reseating again [17:48:20] there are a lot of sticks [17:48:26] wait [17:48:31] I am looking at ILO [17:48:52] Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000026, Bank 0x0000000D, Status 0xEE0007C0'001000C0, Address 0x00000040'4088B4C0, Misc 0x12294F78'D3D4C086). [17:49:01] Uncorrectable Memory Error - The failed memory module could not be determined. [17:49:10] Uncorrectable Memory Error Threshold Exceeded (Processor 2, DIMM 6). The DIMM is mapped out and is currently not available. [17:49:20] is that the same one? [17:49:27] yes [17:50:13] if you want to try again, but it looks suspicious [17:50:26] yeah, I want to move it to the other CPU [17:50:38] ok, then, shutting it down again [17:51:17] cmjohnson1: go ahead [17:54:35] 10DBA, 10Beta-Cluster-Infrastructure, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) This is apparently from T267439 After some discussion with jbond and dancy in irc, I am going to revert that and hope I'm not making the varn... [17:57:39] jynus: swapped DIMM 6 (cpu1) with DIMM 6 (cpu2). Let's see if it corrects, follows or stays the same [18:00:12] DIMM Initialization Error - Processor 2 Channel 1 [18:00:36] wait, same processor different channel? [18:02:03] channel I think is the same, DIMM module is what I am looking for. But the fact the error stayed on processor 2 suggest a new CPU is needed not the DIMM [18:02:07] Mem: 451178 [18:02:34] which is a random number every time? [18:02:51] not sure about memory, because the log error message is different now [18:03:15] Before it said: Uncorrectable Memory Error Threshold Exceeded (Processor 2, DIMM 6). The DIMM is mapped out and is currently not available. [18:03:27] 10DBA, 10Beta-Cluster-Infrastructure, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) Puppet sync back to working. Back on track to continue with the update in deployment-prep. [18:03:46] and now: DIMM Initialization Error - Processor 2 Channel 1. The identified memory channel could not be properly trained and has been mapped out. [18:04:51] could also be a bad board [18:05:03] but it doesn't say DIMM number explicitly [18:05:09] but we just changed it! [18:05:19] the boards they send are refurbished [18:05:28] not the first time a bad board has shown up [18:05:29] yeah, it was mostly a [18:05:38] cry than a rational thought [18:06:06] so I will update the ticket and we will call vendor again, I guess? [18:06:31] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Cmjohnson) reseated all of the DIMM, the erorr remained the same Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000026, Bank 0x000000... [18:06:38] oh, you did [18:06:44] I updated some things but feel free to add more details if needed [18:06:58] thats good enough summary [18:07:01] sorry for this [18:07:32] I will put it back in production until next week [18:07:41] see what are our options [18:08:03] if you are ok with that, cmjohnson1 [18:08:50] sounds good! Thanks and have a great mini-holiday [18:09:12] same to you, cmjohnson1 have some deserved rest, you and your team! [18:10:07] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) :-( I will put db1139 back into production so it is somewhat useful until next week. [18:11:55] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) [18:19:49] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) The memory view tells us that it is now 2, not 1 memory slot that is affected (of course, given the above test it is more likely it is CPU / b... [18:38:37] 10DBA, 10Beta-Cluster-Infrastructure, 10Operations, 10serviceops: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10ArielGlenn) Update in deployment-prep is now complete, assuming I did not miss any hosts. ` root@deployment-cumin:~# cumin 'O{project:deployment-prep... [19:01:00] there is increasing lag on pc2010 [19:38:02] 10DBA, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Slow load times for Special:Homepage on cswiki - https://phabricator.wikimedia.org/T267216 (10kostajh) I'm moving this into the QA column for us to keep an eye on over the next couple weeks. Alternatively or in ad... [22:31:50] 10DBA, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Slow load times for Special:Homepage on cswiki - https://phabricator.wikimedia.org/T267216 (10Tgr) Not much point in monitoring if we don't fix it. IMO this should go back to Needs more work. Or do you mean that i...