[06:21:24] 10DBA, 10Operations, 10ops-codfw: db2054: Disk with predictive failure - https://phabricator.wikimedia.org/T183887#3866270 (10Marostegui) [06:21:44] 10DBA, 10Operations, 10ops-codfw: db2054: Disk with predictive failure - https://phabricator.wikimedia.org/T183887#3866270 (10Marostegui) p:05Triage>03Normal [06:22:11] 10DBA, 10Wikimedia-Site-requests: Global rename of ولاء → لا روسا: supervision needed - https://phabricator.wikimedia.org/T183467#3866284 (10Marostegui) I am around [06:22:19] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3866285 (10Marostegui) I am around [06:24:23] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3866286 (10alanajjar) @Marostegui Welcome Start both? [06:26:55] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3866287 (10Marostegui) One at the time, whichever you want first [06:27:46] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3866288 (10alanajjar) @Marostegui So we'll start from here [06:28:16] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3866289 (10Marostegui) ok, please paste the progress URL here Thanks! [06:28:32] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3866290 (10alanajjar) [[https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Bluemoon2999|The progress]] [06:29:37] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3866291 (10Marostegui) Thanks! [06:35:18] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3866293 (10alanajjar) @Marostegui We passed ar.wiki with 100,000 edits (Y) [07:34:07] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3866329 (10alanajjar) @Marostegui Finished Thanks [07:34:31] 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3866330 (10alanajjar) 05Open>03Resolved a:03alanajjar [07:35:51] 10DBA, 10Wikimedia-Site-requests: Global rename of ولاء → لا روسا: supervision needed - https://phabricator.wikimedia.org/T183467#3866333 (10Marostegui) Let's go for this one then [07:36:17] 10DBA, 10Wikimedia-Site-requests: Global rename of ولاء → لا روسا: supervision needed - https://phabricator.wikimedia.org/T183467#3866334 (10alanajjar) @Marostegui I'll start [07:38:42] 10DBA, 10Wikimedia-Site-requests: Global rename of ولاء → لا روسا: supervision needed - https://phabricator.wikimedia.org/T183467#3866336 (10alanajjar) [[https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/ولاء|The progress]] [07:54:56] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3866341 (10Marostegui) I will alter s1 master tomorrow morning [08:03:42] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3866346 (10Marostegui) s7 eqiad hosts: [] labsdb1001.eqiad.wmnet (broken - will not get it) [] labsdb1003.eqiad.wmnet (replication... [08:04:40] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3866347 (10Marostegui) [08:24:15] 10DBA, 10Wikimedia-Site-requests: Global rename of ولاء → لا روسا: supervision needed - https://phabricator.wikimedia.org/T183467#3866352 (10alanajjar) @Marostegui Done. Thanks [08:24:31] 10DBA, 10Wikimedia-Site-requests: Global rename of ولاء → لا روسا: supervision needed - https://phabricator.wikimedia.org/T183467#3866353 (10alanajjar) 05Open>03Resolved a:03alanajjar [08:41:50] 10DBA, 10Operations, 10ops-codfw: pc2005 crashed: CPU2 internal error - https://phabricator.wikimedia.org/T183750#3866368 (10jcrespo) I think these servers are leased CC @RobH [09:09:35] 10DBA, 10Data-Services: Create backups of user tables from decommissioned database servers - https://phabricator.wikimedia.org/T183758#3866454 (10jcrespo) > 8 TB HDD should be enough and those are only $150. That is a very simplistic view, enterprise grade disks are way more expensive than that- without takin... [09:45:49] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=5&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1010&var-port=9104&from=now-7d&to=now [09:46:09] wow [09:59:08] 10DBA, 10Analytics-Kanban: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3866567 (10elukey) >>! In T168414#3866296, @Marostegui wrote: > I have not reviewed the full list of ALTERs, you have way more knowledge than me about what is needed on those tables :-) > But yes,... [10:03:30] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3866569 (10Marostegui) [10:14:13] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3866606 (10Marostegui) [10:57:41] hello people [10:58:10] I am wondering if a mc reset cold could help with dbstore1002's mgmt interface [10:58:25] (from ipmitool on neodymium) [10:58:39] We have seen in the past that sometimes we have to drain all the power to make the mgmt interface to work :( [10:58:45] but it is worth a try of course [10:59:08] let's see if we are lucky [10:59:08] :D [11:00:16] it doesn't seem so [11:00:23] :( [11:18:52] hi, is there one/two databases in eqiad I could run smartmontools on? I'm rolling out smart in eqiad similarly to what we did in https://gerrit.wikimedia.org/r/#/c/386603/ [11:20:03] godog: is it fully rolled out on codfw, or that is the latest update? [11:20:34] jynus: fully rolled out in codfw [11:20:44] oh, I see the changes now, sorry [11:21:12] do you want one of it kind, or one of it kind that would be "safe"? [11:21:43] or just a couple of host? [11:21:54] a kind that's safe would be best yeah, possibly both hp and dell [11:22:01] you say 1/2 so let me see [11:22:21] we haven't seen any ill effects so far but I'd rather be safe than sorry :)) [11:22:41] db1039 is depooled and will be depooled forever, so that one can take it [11:23:43] db1001 - megacli + ubuntu [11:23:53] db1001 has a failed disk btw [11:24:03] which means is a good test :-) [11:24:36] yeah :) [11:24:55] and one of the s8 replicas [11:25:41] db1087 is HP [11:25:50] and jessie [11:26:35] ok! so 1001 / 1039 / 1087 [11:26:42] I'll send the review your way anyways [11:26:50] let's throw a stretch there, too [11:27:01] db1101 [11:27:05] actually, now that I think about it, db1039 will be decommissioned quite soon (I am finishing with it today actually) [11:27:24] well, we can add it to hiera, and test it once [11:27:31] then delete it :-) [11:27:33] sure :) [11:27:50] 1001 should fix the ubuntu quota [11:29:21] 10DBA, 10Analytics-Kanban: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3866794 (10elukey) Sent an email to announce maintenance for tomorrow (Jan 03). [11:29:44] sweet, thanks marostegui and jynus ! [11:51:16] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3866852 (10Marostegui) [11:51:48] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3250598 (10Marostegui) The first iteration of checks for the s7 databases is completed. I am going to go again and check it more using compare.py [12:26:07] I set up db1055 and db1056 from scratch, purely logically [12:26:44] so I broke some grants on my first deployment from a couple of minutes, on 10% of the x1 load [12:27:04] on the bright side, we have now a clean "template" to work with [12:34:49] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1045 - https://phabricator.wikimedia.org/T174806#3573553 (10MoritzMuehlenhoff) This host still shows up in puppetdb, i.e. misses the deactivate step (e.g. visible in https://servermon.wikimedia.org/hosts/) [12:36:26] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db1049 - https://phabricator.wikimedia.org/T175264#3588137 (10MoritzMuehlenhoff) This host still shows up in puppetdb, i.e. misses the deactivate step (e.g. visible in https://servermon.wikimedia.org/hosts/) [14:02:42] https://gerrit.wikimedia.org/r/401491 <- the review for smart [14:06:25] instead of es1011 I would use es1015 (es1011 is a master) [14:06:38] it should not be impacting I know, but better be safe than sorry :) [15:26:27] 10DBA, 10MediaWiki-Special-pages, 10Wikidata: Rdbms error on Special:Tags on Wikidata - https://phabricator.wikimedia.org/T183921#3867312 (10Mbch331) [15:28:54] 10DBA, 10MediaWiki-Special-pages, 10Wikidata: Rdbms error on Special:Tags on Wikidata - https://phabricator.wikimedia.org/T183921#3867327 (10Mbch331) Just checked nlwiki and there the page works, so it seems to be related to Wikidata only. [15:33:29] 10DBA, 10MediaWiki-Special-pages, 10Wikidata: Rdbms error on Special:Tags on Wikidata - https://phabricator.wikimedia.org/T183921#3867312 (10jcrespo) This seem to be: ``` ChangeTags::tagUsageStatistics 10.64.16.76 2062 Read timeout is reached (10.64.16.76) SELECT ct_tag,count(*) AS `hitcount` FROM `change_... [15:43:06] 10DBA, 10Operations, 10ops-codfw: db2054: Disk with predictive failure - https://phabricator.wikimedia.org/T183887#3867388 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below... [15:50:01] 10DBA, 10Operations, 10ops-codfw: pc2005 crashed: CPU2 internal error - https://phabricator.wikimedia.org/T183750#3867417 (10RobH) Lease versus purchase has no change in warranty support, just in our tracking of hardware. This should be able to be processed as a normal under warranty server. (Leasing just... [16:01:22] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3867470 (10jcrespo) [16:11:25] 10DBA, 10MediaWiki-Platform-Team, 10Structured-Data-Commons, 10Wikidata: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044#3867529 (10CCicalese_WMF) [16:39:56] 10DBA, 10MediaWiki-Database: Potential shard confussion on loadbalancer checking s1 lag on x1 hosts? Or just config outdated/fail? - https://phabricator.wikimedia.org/T183925#3867662 (10jcrespo) [16:43:55] 10.64.48.11:3314 may be missing some grants, are you working on db1097 at the moment? [16:47:40] nope [16:47:51] db1098:3317 is what I am working on now [16:47:52] I will add them manually [16:48:07] they are heartbeat checks from labswiki [16:48:11] thanks [16:55:25] doing the same for db1103 [17:11:53] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1001 - https://phabricator.wikimedia.org/T183708#3867855 (10Cmjohnson) Disk Swapped [17:12:53] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1001 - https://phabricator.wikimedia.org/T183708#3867877 (10Marostegui) Thanks! ``` root@db1001:~# megacli -pdrbld -showprog -physdrv\[32:1\] -aALL Rebuild Progress on Device at Enclosure 32, Slot 1 Completed 3% in 1 Minutes. ``` [19:57:58] 10DBA, 10Data-Services, 10Goal, 10cloud-services-team (FY2017-18): Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#3868965 (10madhuvishy) p:05Triage>03Normal