[02:02:06] 10DBA, 10Data-Services, 10Epic: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3692624 (10bd808) [02:09:38] 10DBA, 10Data-Services, 10Tracking: Wikireplica service for tools and labs - issues and missing available views (tracking) - https://phabricator.wikimedia.org/T150767#3692628 (10bd808) [02:09:41] 10DBA, 10Cloud-Services, 10Operations, 10Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3692629 (10bd808) [02:09:45] 10DBA, 10Data-Services, 10Epic: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3692625 (10bd808) 05Open>03Resolved a:03jcrespo The main description here and https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database/Replica_drift have been updated to reflect our confidence tha... [02:17:18] 10DBA, 10Data-Services, 10Epic: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3692637 (10bd808) [02:22:12] 10DBA, 10Cloud-Services, 10Operations, 10Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3692639 (10bd808) [02:25:46] 10DBA, 10Data-Services, 10Tracking: Wikireplica service for tools and labs - issues and missing available views (tracking) - https://phabricator.wikimedia.org/T150767#3692641 (10bd808) [02:29:20] 10DBA, 10Data-Services, 10Tracking: Wikireplica service for tools and labs - issues and missing available views (tracking) - https://phabricator.wikimedia.org/T150767#3692646 (10bd808) 05Open>03declined a:03jcrespo Tracking task deprecated. Please file tasks tagged as #data-services instead. [02:34:45] 10DBA, 10Community-Tech, 10Data-Services, 10Security, 10cloud-services-team (Kanban): Create core ip_changes view for replicas - https://phabricator.wikimedia.org/T173891#3692653 (10bd808) 05Open>03Resolved a:03Andrew @andrew merged my config patch and ran the script to create the views. ``` $ sql... [02:37:31] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3692657 (10Vicpeters) Thanks for help! It works now! [02:52:54] 10DBA, 10Data-Services, 10User-bd808, 10cloud-services-team (Kanban): Determine schema differences between labsdb1001 and labsdb1009 - https://phabricator.wikimedia.org/T177223#3692662 (10bd808) >>! In T177223#3651125, @jcrespo wrote: > There is already 5 related things that, even nothing to do with this,... [02:54:22] 10DBA, 10Data-Services, 10User-bd808, 10cloud-services-team (Kanban): Determine schema differences between labsdb1001 and labsdb1009 - https://phabricator.wikimedia.org/T177223#3692664 (10bd808) From {rOPUP2b2050646943e3f7dd03370692b90aba6fdda669} we also now have the `modules/role/files/labs/db/views/extr... [03:00:12] 10DBA, 10Data-Services, 10User-bd808, 10cloud-services-team (Kanban): Determine schema differences between labsdb1001 and labsdb1009 - https://phabricator.wikimedia.org/T177223#3692668 (10bd808) Deeper inspection by me is currently blocked by {T178128}, but my intent is to run a tool like [[https://dev.mys... [03:00:31] 10DBA, 10Data-Services, 10User-bd808, 10cloud-services-team (Kanban): Determine schema differences between labsdb1001 and labsdb1009 - https://phabricator.wikimedia.org/T177223#3692670 (10bd808) 05Open>03stalled [03:00:34] 10DBA, 10Data-Services, 10Patch-For-Review: Some queries to new replica hosts are dramatically slower than labsdb; missing indexes? - https://phabricator.wikimedia.org/T177096#3692671 (10bd808) [05:19:45] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3692769 (10Marostegui) [05:20:22] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3672104 (10Marostegui) [05:20:58] 10DBA, 10Operations, 10ops-eqiad: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383#3692787 (10Marostegui) The ALTER tables finished correctly and no more crashes happened. Let's change the memory anyways @Cmjohnson [05:22:15] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3692792 (10Marostegui) [05:29:46] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3692796 (10Marostegui) [06:43:47] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3689827 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db2084.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2017101... [07:01:27] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3692858 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2084.codfw.wmnet'] ``` and were **ALL** successful. [07:40:57] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3692900 (10Marostegui) [09:25:36] jynus: marostegui: Heads up: cawiki has statement usage again (since about an hour) [09:25:39] and it looks fine, yet [09:25:45] I'll monitor throughout the day [09:25:58] remember: Fine to revert that change if it causes any trouble [09:26:00] thanks [09:26:06] cool ,thanks for the heads up [09:26:28] I'm off for lunch, but will keep having an eye on the graphs throughout the day [09:38:51] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=11&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1010&var-port=9104&from=1507714724221&to=1508319524221 [09:39:13] purge history now going down [09:42:12] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3693088 (10Marostegui) [09:47:53] 10DBA, 10Operations: db1082 crashed - https://phabricator.wikimedia.org/T178460#3693095 (10Marostegui) [09:49:00] what about Revert "db-eqiad.php: Depool db1098" ? [09:49:03] 10DBA, 10Operations: db1082 crashed - https://phabricator.wikimedia.org/T178460#3693109 (10Marostegui) Not the first time this happens to this same host: T158188 T145533 [09:49:06] ok to merge? [09:49:22] marostegui: ^ [09:49:37] or should I cherry pick my patch [09:50:59] reverting depool db1098 is fine [09:51:01] you can merge [09:52:27] 10DBA, 10Operations: db1082 crashed - https://phabricator.wikimedia.org/T178460#3693117 (10Marostegui) The RAID looks good: ``` root@db1082:~# hpssacli controller all show config Smart Array P840 in Slot 1 (sn: PDNNF0ARH1910I) Port Name: 1I Port Name: 2I Internal Drive Cage at Por... [09:54:54] 10DBA, 10Operations, 10ops-eqdfw: db1082 crashed - https://phabricator.wikimedia.org/T178460#3693095 (10Marostegui) p:05Triage>03High This is the second time this server has a storage crash: T158188 @Cmjohnson can we get a new RAID controller for this host? It has happened twice already. [09:55:16] 10DBA, 10Operations, 10ops-eqiad: db1082 crashed - https://phabricator.wikimedia.org/T178460#3693131 (10Marostegui) [10:01:39] 10DBA, 10Operations, 10ops-eqiad: db1082 storage crashed - https://phabricator.wikimedia.org/T178460#3693168 (10Marostegui) [10:19:05] 10DBA, 10Community-Tech, 10Data-Services, 10Security, 10cloud-services-team (Kanban): Create core ip_changes view for replicas - https://phabricator.wikimedia.org/T173891#3693202 (10IKhitron) [[https://www.mediawiki.org/wiki/Manual:Ip_changes_table|Documentation]]. [10:24:57] 10DBA, 10Operations, 10Patch-For-Review: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3693217 (10jcrespo) To try to solve the previous issue, the following grants have been executed: ``` root@dbstore2001> set sql_log_bin=0; GRANT SELECT,... [10:41:58] hello people [10:42:27] I created https://gerrit.wikimedia.org/r/#/c/384963/1 to reimage db110[78] to stretch and set notifications disabled [10:42:34] marostegui: can we pool temporarely [10:42:42] other s5 host as api [10:42:50] db1071 may be overloaded [10:44:10] elukey: one thing is missing, that you don't know (it is a DBA secret) [10:45:22] unlock them for reimage [10:45:48] oh this is interesting :D [10:46:08] we setup this after an es server was reimaged accidentally [10:46:09] jynus: sure we have plenty on s5 [10:46:20] we can use db1105 for instance [10:46:26] cool [10:46:30] you want me to do it? [10:46:34] I will pool it as api [10:46:36] don't worry [10:46:37] cool [10:46:38] thanks [10:47:53] elukey: puppet/modules/install_server/files/autoinstall/netboot.cfg [10:50:45] * elukey tries to figure out how to unlock [10:51:02] apparently, cmjohnson did it already [10:51:31] git diff e9a931db704 e9a931db704~1 [10:51:38] so no need to [10:51:54] so basically the trick is not to have any partman config ? [10:52:06] basically, for the reimage to happen successfuly, it must have db.cfg [10:52:16] okok got it :) [10:52:17] no, it it is not there [10:52:27] there is a rule bellow [10:52:32] that doesn't format the server [10:53:02] the catch all not srv format [10:53:13] we should add them to db-no-srv-format.cfg after reimage [10:53:21] sure [10:53:24] and a few days of seeing we do not have to do it again [10:53:59] you did not gave them any role [10:53:59] another (probably super trivial) question: can I safely apply ::role::mariadb::misc::eventlogging to both hosts since they are not in any dbproxy config? Or should I take extra precautions? [10:54:05] yes :) [10:54:07] which is ok, we can do that later [10:54:15] but I said the notifications disabled [10:54:32] because I thought were going to puppetize them from the beginning [10:55:03] I do not think the default host pages for anything [10:55:20] I can surely add puppet config in site.pp now, just wanted to make sure that they woudn't cause any issue to db104[67] [10:55:26] re ::role::mariadb::misc::eventlogging [10:55:44] as a rule- we do not puppetize anything related to state [10:55:59] (that is why we have to provision them from the begining) [10:56:08] however, puppet may fail to connect to mysql [10:56:18] not a big deal [10:56:30] if you want to setup mysql data first, that is ok [10:56:50] although, we should probably not have 2 masters at the same time [10:57:05] so we setup the replica, check that it is workign from the old master [10:57:09] then we migrate the master? [10:57:16] sounds ok? [10:57:44] (stopping event logging updates, etc.) [10:58:05] makes sense! [10:58:15] basically ::role::mariadb::misc::eventlogging [10:58:22] custom_repl_slave is a bit more tricky [10:58:48] this would be a good oportunity [10:58:52] to clean up puppet [10:59:01] so copy & paste instead of migrate [10:59:08] to proper profiles [10:59:13] and then delete the old ones [10:59:15] but up to you [10:59:25] +1, I can try to do it and send a code review [10:59:26] there are many thing I would like to change for 10.1 [10:59:32] and stretch [10:59:41] that were wrong but the only thing possible for ubuntu [10:59:53] socket location, systemd integration, etc. [11:00:14] we have them already implemented everywhere else, so it is not a new thing [11:00:31] just migrate old analytics-m4 to the new model [11:00:42] makes sense [11:01:06] we should also give a look at config [11:01:10] (mysql) [11:01:18] so I can do this: reimage the new hosts without any puppet role, in the meantime work on the puppet config with your supervision [11:01:24] but that can come later, that is why I said you can do that part on your own [11:01:30] yeah [11:01:46] super, so I'll start the work after lunch and then explore the puppet config [11:01:48] the only thing here is to remember to retire them from the reimage [11:01:55] got it [11:02:06] so they cannot be accidentally reimaged [11:02:20] the other thing is that db1108 will not replicate wiki databases but only eventlogging [11:02:32] yeah, that is actually the easy part [11:02:47] an if in the future we have to, we can [11:03:03] those servers are larger than db1047 [11:03:15] but not as large as to have all that dbstore1002 has [11:03:25] looking forward to run eventlogging cleaner on db1108 :D [11:03:41] yeah, it will likely be 6x faster [11:04:08] to be fair, we may encounter many puppet issues [11:04:10] small ones [11:04:24] but those hosts predate 100% puppet coverage [11:05:13] I think we should have something like role::mariadb::misc::eventlogging::master and replica [11:05:21] and the profile the mysql parts [11:05:34] and the eventlogging custom replication aparts separatelly [11:05:38] *then [11:05:44] makes sense, this is a good occasion to clean up puppet [11:06:04] one profile because it is a database [11:06:14] and another profile beacase eventlogging [11:06:53] then there is the typical discussion if we should have role::mariadb::labsdb or role::labs::mariadb :-) [11:07:51] ahhaha [11:07:59] all right going to lunch and then I'll start the work :) [11:08:02] thanks! [11:13:52] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3693514 (10Marostegui) I was thinking on a draft of how to set up the multi-instance hosts, and how to combine the shards within the hosts. What about this: A host will handle two shards, so we could ha... [11:22:20] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3693523 (10jcrespo) Looks ok to me, good combination. s2 and some other in the past suffered DOS due to recentchanges queries, but none since the query killer was setup, and it shouldn't affect other ins... [11:23:55] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3693524 (10Marostegui) >>! In T178359#3693523, @jcrespo wrote: > Looks ok to me, good combination. s2 and some other in the past suffered DOS due to recentchanges queries, but none since the query killer... [11:25:49] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3693525 (10Marostegui) [11:44:42] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3693560 (10jcrespo) Actually, this may be overthinking, but could an availability issue be having the same pair all the time. Should we think about having s1 <-> s3, s1 <-> s2, s3 <-> s4, s4 <->s2 or wou... [11:49:58] the old dumps are working, at least this time no issues with stop slave [12:03:21] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3693587 (10Marostegui) >>! In T178359#3693560, @jcrespo wrote: > Actually, this may be overthinking, but could an availability issue be having the same pair all the time. Should we think about having s1... [12:10:27] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3693599 (10jcrespo) Maybe, I would switch: 4 hosts doing: s1 s3 s2 s4 s5 s7 s6 s8 And 4 hosts doing: s1 s2 s7 s6 s5 s4 s3 s8 So s1, s4, s8, s7 are never together (there are multiple combination... [12:11:33] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3693600 (10Marostegui) Sounds good to me and with that combination I do not have to undo anything ;-) [12:13:19] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3693605 (10Marostegui) [12:17:51] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3693611 (10Marostegui) It would be a bit more painful to provision a new host, if needed, as it would not be like copying just from a single host, but from two. But apart from that... [12:46:32] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3693655 (10Marostegui) A draft of the distribution for codfw: db2092 (B8): s1,s3 db2091 (A8): s2, s4 db2086 (B1): s5,s7 db2089 (A3): s6,s8 db2087 (B1): s6,s7 db2088 (D1): s1,s2 db2084 (D6): s4,s5 db20... [13:35:05] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3693781 (10jcrespo) Note: ``` db2086 (B1): s5, s7 db2087 (B1): s6, s7 ``` [13:35:53] question about netboot: is db[0-8][0-9] a catch all for all the hostnames starting with db and two digits? [13:36:05] (to get db-no-srv-format.cfg) [13:38:54] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3693797 (10Marostegui) >>! In T178359#3693781, @jcrespo wrote: > Note: > > ``` > db2086 (B1): s5, s7 > db2087 (B1): s6, s7 > ``` Fixed, it was a type - db2087 is on C1 [13:40:11] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3693815 (10jcrespo) [13:41:03] 10DBA, 10Analytics, 10Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3693836 (10elukey) Updating this task in light of the recent discussions. The analytics and DBA teams have been fighting a lot with disk space consumption on dbstore1002 due t... [14:32:11] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3694025 (10Cmjohnson) The disk has been replaced [14:39:36] 10DBA, 10Operations, 10ops-eqiad: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383#3694030 (10Cmjohnson) A new DIMM has been requested with Dell. You have successfully submitted request SR955416674. [14:48:16] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1082 storage crashed - https://phabricator.wikimedia.org/T178460#3694047 (10Cmjohnson) A case with HPE has been submitted Your case was successfully submitted. Please note your Case ID: 5323881381 for future reference. [14:50:29] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3652772 (10jcrespo) ``` physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 800 GB, Rebuilding) ``` [15:14:42] 10DBA, 10Operations, 10ops-eqiad: db1101 crashed - memory errors - https://phabricator.wikimedia.org/T178383#3694184 (10Marostegui) Great!! Thanks! [15:15:24] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1082 storage crashed - https://phabricator.wikimedia.org/T178460#3694186 (10Marostegui) Thank you! [15:41:14] 10DBA, 10Operations, 10Patch-For-Review: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3694264 (10jcrespo) Latest run does flow and others too, correctly: ``` root@dbstore2001:/srv/backups/x1.20171017220041$ ls -la | grep flowdb -rw-r--r--... [15:53:29] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3694315 (10jcrespo) 05Open>03Resolved a:03jcrespo ``` physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 800 GB, OK) RECOVERY - HP RAID on db1092 is OK: OK: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:... [16:50:53] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3694450 (10hoo) [17:30:20] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata: Enable description usage tracking on further wikis - https://phabricator.wikimedia.org/T178515#3694560 (10hoo) [17:51:20] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10User-Daniel, 10Wikidata-Sprint: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3694646 (10hoo) We're already at 6,213,948 statement usages on `cawiki` (this is a lot). The number of usages is st... [23:04:54] 10DBA, 10Wikimedia-General-or-Unknown, 10Performance: Clean up skin properties - https://phabricator.wikimedia.org/T171643#3695440 (10Jdlrobson)