[05:15:13] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1065 - https://phabricator.wikimedia.org/T196490#4259845 (10Marostegui) 05Open>03Resolved a:03Cmjohnson All good! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID... [05:16:07] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1065 storage crash - https://phabricator.wikimedia.org/T195444#4259848 (10Marostegui) 05Open>03Resolved After replacing disk #1, this is all good now. ``` root@db1065:~# megacli -LDPDInfo -aAll | grep -i flagged Drive has flagged a S.M.A.R.T alert... [06:03:01] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review, 10Schema-change: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316#4259876 (10Marostegui) [06:03:30] 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10Schema-change: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926#4259877 (10Marostegui) [06:03:37] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Patch-For-Review, 10Wikidata-Ministry-Of-Magic: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193#4259880 (10Marostegui) [06:04:25] 10DBA, 10Operations, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4259884 (10Marostegui) [06:04:30] 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10Schema-change: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926#4154116 (10Marostegui) [06:04:45] 10DBA, 10Operations, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031557 (10Marostegui) [06:04:51] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review, 10Schema-change: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316#4101112 (10Marostegui) [07:11:14] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4259933 (10jcrespo) @RobH Can you check if we have next-business day support for defects for this hw provider and purchase? Because they seem to not be honoring that/adding some on-purpose delay. [07:20:04] did you restart already labsdb1010 ? [07:20:10] yep [07:20:22] I was going to suggest to install the microcode [07:20:49] labsdb1011 is depooled, I wanted to move it to the new sanitariums in a bit, if you are around [07:21:28] maybe we can do it there? [07:21:58] sure :) [07:30:30] Going to do so and reboot [07:37:23] so does the package have the same name? [07:37:34] as we were told [07:41:28] yeah [07:41:39] apt-get install intel-microcode [07:41:52] labsdb1011 is back, I am going to start the pre steps for moving it under the new sanitariums [07:41:59] we need a way to check it is enabled, though [07:42:04] moritzm: ^ [07:43:04] what I've been doing for the current canaries is to record the microcode version before and after the reboot: [07:43:32] "grep microcode /proc/cpuinfo | uniq" essentially [07:43:53] yeah [07:43:57] I was seeing that on dmesg [07:43:59] there's also WIP to track pending microcode reboots in https://gerrit.wikimedia.org/r/#/c/436553/ [07:44:00] thanks [07:44:02] it is different on 1011 and 1010 [07:44:09] that's not uncommon [07:44:19] I had seen that for other clusters as well [07:44:36] basically when Intel makes a CPU they add the current microcode release on die [07:44:39] What I mean is that after the reboot it is different [07:44:56] ah, that's good then :-) [07:45:04] so labsdb1010 and 1009 have the same one and 1011 (just rebooted) had a different one [07:45:13] but I also noticed differences in exising clusters before the reboot for the reason above [07:45:39] and also for CPU replacements, if they replace a CPU under warranty it's usually replaced with a modle with has the more recent microcode on die [07:45:57] microcode: microcode updated early to revision 0x3c, date = 2018-01-19 [07:46:06] yeah, that is what dmesg shows [07:46:27] and it is different in labsdb1010, so I think it was picked up [07:46:59] I guess 0x3c > 0x35 ? [07:50:19] sounds good, once the prometheus exporter is live, we can also track this a little more systematic [07:54:10] jynus: I am all set to start moving labsdb1011 [07:54:13] all the pre steps are done [08:03:47] replication started on s5 for testing noiw [08:05:57] All threads started [08:06:00] and looking good so far [08:13:04] we are having lots of errors on geodata [08:13:21] where are you seeing that? [08:13:26] Ah I see [08:15:43] There was some related patches deployed on the train last night [08:15:49] https://www.mediawiki.org/wiki/MediaWiki_1.32/wmf.7#GeoData [08:26:30] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4260035 (10Marostegui) labsdb1011 has been moved over the new sanitarium. This was the last host to be moved. Let's wait to mak... [08:26:48] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4260036 (10Marostegui) [08:27:05] marostegui: https://phabricator.wikimedia.org/T196526 [08:28:42] how did you find it could be related to locator-tool? [08:28:57] see error, it is the referrer [08:29:30] Aaah .) [08:29:32] :) [08:31:04] do you have any thought about the usage of the new servers? [08:31:49] the old/temp sanitariums? [08:31:57] yes [08:32:04] yeah, I was thinking about it yesterday [08:32:14] I wanted to check what needs to be replaced in core, if anything [08:32:22] specially dumps and all that [08:32:27] in theory 2 are for backup testing [08:32:32] no [08:32:43] backup generation, I think [08:33:27] yeah, true [08:33:39] how many do we have to setup in total? [08:33:46] physical hosts [08:33:50] 4 [08:34:00] you mean how many free up? [08:34:04] yes [08:34:17] but where did we stole the temporary from? [08:34:31] and that is on eqiad? [08:34:41] what about codfw? [08:34:43] yeah we have 4 that will be freed up in eqiad [08:34:56] the codfw were core hosts, that are back to their original position [08:35:04] already? thanks1 [08:35:13] yeah, they are back in service [08:35:35] and were did we get the 2 for temporary? [08:35:46] because I remember not having a lot of room [08:35:54] in eqiad? [08:35:57] yes [08:36:07] I think either dumps or backup generation [08:36:25] yeah, they where new ones for that, right? [08:36:31] I think so yep [08:37:21] maybe we can move the dbstore1001 s1 instance [08:37:36] and setup some more, keeping dbstore1001 (for now) only as a disk [08:38:08] you mean convert two of those to "dbstore" ? [08:38:13] Like replicating a few instances [08:38:14] not really [08:38:21] it is more of a [08:38:28] backup generation hosts [08:38:37] but not that there is big difference [08:38:42] yeah, I mean, replicating basically [08:38:45] yeah [08:38:49] like build them as multi-instance [08:38:53] yeah [08:39:03] sure, that sounds like the plan we had for them [08:39:04] there will not be a lot of difference [08:39:08] yeah [08:39:12] with setting up dumps there [08:39:21] dump mediawiki hosts [08:39:43] yeah [08:39:59] the main difference is that those can be delayed [08:40:04] and maybe stopped [08:40:10] indeed [08:40:30] so right now db1120 is now set as spare and I will do the same with db1116 today (T196376) [08:40:31] T196376: Productionize old/temporary eqiad sanitariums - https://phabricator.wikimedia.org/T196376 [08:40:38] for db1095 and db1102 I will wait till next week probably [08:40:45] yeah, makes sense [08:40:49] or even more if needed [08:41:07] yeah [08:41:31] we will have to do a lot of cleanup afterwards [08:41:34] on puppet [08:42:04] I wanted to try to do today something about replication changes [08:43:31] <3 <3 <3 [08:46:23] 10DBA: Clean up sanitarium_multisource related code - https://phabricator.wikimedia.org/T196527#4260090 (10Marostegui) p:05Triage>03Normal [08:46:57] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4241885 (10Marostegui) [08:47:00] 10DBA: Clean up sanitarium_multisource related code - https://phabricator.wikimedia.org/T196527#4260090 (10Marostegui) 05Open>03stalled Stalling this task as db1095 is still not a spare [08:53:01] 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4209553 (10Addshore) Thanks! [09:48:12] 10DBA, 10Patch-For-Review: Productionize old/temporary eqiad sanitariums - https://phabricator.wikimedia.org/T196376#4260378 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1116.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-re... [10:02:26] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review, 10Schema-change: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316#4260427 (10Marostegui) [10:02:43] 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10Schema-change: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926#4260428 (10Marostegui) [10:02:57] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Patch-For-Review, 10Wikidata-Ministry-Of-Magic: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193#4260429 (10Marostegui) [10:03:02] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#4260430 (10Marostegui) [10:05:20] 10DBA, 10Patch-For-Review: Productionize old/temporary eqiad sanitariums - https://phabricator.wikimedia.org/T196376#4260437 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1116.eqiad.wmnet'] ``` and were **ALL** successful. [10:07:11] 10DBA, 10Patch-For-Review: Productionize old/temporary eqiad sanitariums - https://phabricator.wikimedia.org/T196376#4260441 (10Marostegui) [10:08:05] when do you think we could do m2 and m3 master switchover? [10:08:23] want to do it tomorrow? [10:08:25] morning? [10:08:31] or after lunch today? [10:08:51] not sure [10:08:59] for m2, for debmonitor, nothing to worry about, feel free anytime [10:09:20] I would prefer to have alex and someone from releng respectively [10:09:29] and more people to check things are not broken [10:10:24] I think I already got the right way to do it (kill connections after switch), but some services may need a restart or somethng [10:10:40] sure, maybe send an email to get people's attention? and schedule a day? [10:11:29] I will ask individual people today and see if they can tomorrow [10:11:36] sure :) [10:11:54] we need some preparation [10:12:05] do we will have the checklist from last time? [10:12:30] I think so, let me see [10:14:10] https://wikitech.wikimedia.org/wiki/MariaDB/misc#Example_Failover_process [10:21:27] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review, 10Schema-change: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316#4260469 (10Marostegui) [] labsdb1009 [] labsdb1010 [] labsdb1011 [] db1102 [] db1125 [] dbstore1002 [] db1085 [] db1088... [10:21:30] 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10Schema-change: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926#4260470 (10Marostegui) [] labsdb1009 [] labsdb1010 [] labsdb1011 [] db1102 [] db1125 [] dbstore1002 [] db1085 [] db1088 [] db1093 []... [10:21:32] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Patch-For-Review, 10Wikidata-Ministry-Of-Magic: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193#4260471 (10Marostegui) [] labsdb1009 [] labsdb1010 [] labsdb1011 [] db1102 [] db1125... [10:22:02] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#4260475 (10Marostegui) [] labsdb1009 [] labsdb1010 [] labsdb1011 [] db1102 [] db1125 [] dbstore1002 [] db1085 [] db1088 [] db1093 [] db1096 [] d... [10:22:09] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#4260476 (10Marostegui) [10:22:20] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Patch-For-Review, 10Wikidata-Ministry-Of-Magic: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193#4260477 (10Marostegui) [10:22:41] 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10Schema-change: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926#4260478 (10Marostegui) [10:22:46] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review, 10Schema-change: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316#4260481 (10Marostegui) [14:54:29] o/ marostegui [14:55:31] We have these test DBs for testing MCR migration scripts, how hard would it be to get the copys of the DBs updated to the current real time DBS and frozen again so we have copies with the latest schema changes etc? [14:56:17] You mean: deleting the MCR database, and repopulating them again with current production data? [14:56:22] yu [14:56:24] p [14:56:26] *yup :) [14:56:58] I am checking, it is only commons and eowiki [14:57:06] I believe so [14:57:19] It shouldn't take long I think [14:57:20] *goes to find the related ticket* [14:57:40] oh, this is nice [14:57:56] because I wanted to upgrade those to 10.1 [14:58:04] :D [14:58:06] what where those hosts again? [14:58:12] db1111 and db1112 [14:58:17] related ticket is https://phabricator.wikimedia.org/T196172 i believe [14:58:27] they are jessie actually even [14:58:35] addshore: can I remove everthing there? [14:58:47] jynus: give me 1 minuite to confirm [14:58:52] jynus: let me do a backup of the users too :) [14:58:58] marostegui: ok [14:59:05] addshore: confirm on ticket doing CC [14:59:48] marostegui: store the grants on a file on your home on mysql-root-hosts [15:00:06] jynus: will do [15:00:55] marostegui: what would the time frame for the update be? [15:00:59] a few hours? a few days? [15:01:05] 40 minutes each [15:01:11] okay! [15:01:14] 45 maybe [15:01:22] I think we can have it in a day yes [15:01:23] actually more [15:01:29] that is for just he upgrade [15:01:43] if you need data reloaded, that is aside [15:02:04] but we have now 10.1/strech on most of production [15:02:18] how long would the data reload take? :) [15:02:21] only 50 servers are still on jessie or so [15:02:23] A day or so [15:02:26] it depends [15:02:40] tell me what your needs are [15:02:50] as in "continue working" [15:02:52] etc. [15:02:57] so we can do one at a time [15:03:10] If it is like the last time, when we imported: a day [15:03:24] We simply need to know it won't take many multiple days or weeks :) [15:03:34] no [15:03:46] but you are ok with one day of not having those? [15:03:52] jynus: yes [15:04:06] I mean most of the time is loading the data [15:04:11] upgrading is the easy task [15:04:45] marostegui: also read_only has to be off, I guess? [15:05:00] to put everything on that file so we don't forget [15:05:11] what? [15:05:23] I got lost :) [15:05:25] the backup of the changes for the test hosts [15:05:31] you mentioned [15:05:36] the users, you mean? [15:05:45] I was asking if something else [15:05:53] read_only = 0 ? [15:06:01] something else? [15:06:16] After the upgrade? [15:06:26] yes, I inted to reimage [15:06:41] yeah, no, just reimage (once I get the users, which I am doing now) and then leave it with me [15:07:41] jynus: https://phabricator.wikimedia.org/T196172#4261241 I tagged DBA in the comment, but didn't bother adding it to the ticket projects [15:07:44] unless you would like me to [15:07:46] addshore: to finish the conversation, ask around if we can drop and reload everthing and either copy it yourself (assuming it is just code/schema) [15:07:49] and propose a day [15:08:10] yup, all sounds fine :) [15:08:18] is it tomorrow ok? [15:08:22] we don't need anything that is there, and 1 day would be fine [15:08:26] tomorrow sounds good [15:08:28] ok [15:09:21] I have backuped what I needed [15:09:40] // load commons and eowiki [15:09:53] // set as read-write [15:10:04] only db1111 [15:10:05] those are my notes [15:10:09] ah, ok [15:10:29] do you setup replication? [15:10:31] Yes [15:10:39] I mean from production to it [15:10:43] No no [15:10:48] only between db1111 and db1112 [15:10:50] ok [15:11:23] will they notice if they do incompatible alters? [15:11:43] I guess not our problem for now :-) [15:11:46] :) [15:11:56] When you reimage them let me know and I will set everything back as it was [15:13:47] ok [15:20:22] 10DBA, 10Patch-For-Review: Decommission db1051 - https://phabricator.wikimedia.org/T195484#4261315 (10jcrespo) [15:28:27] 10DBA: Decommission db1051 - https://phabricator.wikimedia.org/T195484#4261356 (10jcrespo) [15:29:43] 10DBA: Decommission db1051 - https://phabricator.wikimedia.org/T195484#4228629 (10jcrespo) a:03jcrespo [16:34:48] 10DBA, 10Patch-For-Review: Decommission db1051 - https://phabricator.wikimedia.org/T195484#4261606 (10jcrespo) [17:22:56] 10DBA, 10MediaWiki-User-management, 10Anti-Harassment (AHT Sprint 21/22): Draft a proposal for granular blocks table schema(s), submit for DBA review - https://phabricator.wikimedia.org/T193449#4261799 (10dbarratt) >>! In T193449#4261702, @dmaza wrote: > Yes and yes. We are only adding tables as part of this... [17:26:06] 10DBA, 10MediaWiki-User-management, 10Anti-Harassment (AHT Sprint 21/22): Draft a proposal for granular blocks table schema(s), submit for DBA review - https://phabricator.wikimedia.org/T193449#4261806 (10jcrespo) ipblocks should be easy to modify, but for schema changes we require a very specific workflow t... [17:39:11] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4261829 (10jcrespo) There is a script `operations/software/dbtools/events_sanitarium.sql` that should be checked, updated and d... [17:44:06] 10DBA, 10MediaWiki-User-management, 10Anti-Harassment (AHT Sprint 21/22): Draft a proposal for granular blocks table schema(s), submit for DBA review - https://phabricator.wikimedia.org/T193449#4261842 (10TBolliger) >>! In T193449#4261799, @dbarratt wrote: >>>! In T193449#4261702, @dmaza wrote: >> Yes and ye... [17:54:00] 10DBA, 10Data-Services: Add statistics table to information_schema_p - https://phabricator.wikimedia.org/T196570#4261913 (10jcrespo) `SHOW INDEX FROM enwiki_p.revision` will not be possible, at least not short term, because views cannot be showed indexes, and in order to do that for the underlying table, you n... [17:56:22] 10Blocked-on-schema-change, 10DBA, 10AbuseFilter: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295#4261925 (10jcrespo) p:05Triage>03Normal [17:57:17] 10DBA, 10MediaWiki-User-management, 10Anti-Harassment (AHT Sprint 21/22): Draft a proposal for granular blocks table schema(s), submit for DBA review - https://phabricator.wikimedia.org/T193449#4261933 (10Anomie) >>! In T193449#4261842, @TBolliger wrote: > Correct — we expect there to be overlapping blocks i... [17:57:48] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4261935 (10Marostegui) I do see it is deployed on db1095 and on db1102 on the `ops` database It needs some checking, but I gues... [18:00:14] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4261959 (10Marostegui) It is from 3 years ago...: https://gerrit.wikimedia.org/r/#/q/events_sanitarium.sql [18:04:16] 10DBA, 10MediaWiki-User-management, 10Anti-Harassment (AHT Sprint 21/22): Draft a proposal for granular blocks table schema(s), submit for DBA review - https://phabricator.wikimedia.org/T193449#4261984 (10dbarratt) >>! In T193449#4261933, @Anomie wrote: > FYI: If you want overlapping blocks, you'll not only... [18:04:38] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4261989 (10jcrespo) > if they are really needed anymore They are needed, a different thing is how much changes they need, but... [18:06:45] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4262001 (10jcrespo) See: T196570 [18:09:17] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4262014 (10Marostegui) I definitely think we do not need the `ops` database one on sanitarium hosts, those are probably entries... [18:16:52] 10DBA, 10MediaWiki-User-management, 10Anti-Harassment (AHT Sprint 21/22): Draft a proposal for granular blocks table schema(s), submit for DBA review - https://phabricator.wikimedia.org/T193449#4262032 (10kaldari) >I have to agree with @dbarratt that Option B is the more solid implementation and like you sai... [18:19:02] 10DBA, 10MediaWiki-User-management, 10Anti-Harassment (AHT Sprint 21/22): Draft a proposal for granular blocks table schema(s), submit for DBA review - https://phabricator.wikimedia.org/T193449#4262049 (10dbarratt) >>! In T193449#4262032, @kaldari wrote: > Except that they aren't actually that different. Cur... [18:20:16] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4262072 (10Marostegui) Nevermind my comments above. They have nothing to do with the sanitarium events. The ones on the file a...