[05:22:39] 10DBA, 10Wikimedia-Site-requests, 07Tracking: Database table cleanup (tracking) - https://phabricator.wikimedia.org/T18660#3078834 (10Krinkle) [07:05:46] 10DBA, 06Operations, 10ops-eqiad: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3079045 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1060.eqiad.wmnet'] ``` The l... [07:30:35] 10DBA, 06Operations, 10ops-eqiad: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3079091 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1060.eqiad.wmnet'] ``` and were **ALL** successful. [07:47:53] 10DBA, 10Wikidata, 07Schema-change, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3079110 (10Marostegui) >>! In T159718#3077513, @jcrespo wrote: > 2. Populate new column a... [07:54:38] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#3079114 (10Marostegui) So, I have deployed the patch to enable gtid_domain_id on all the core hosts. It will be picked up upon restart, but I am manually enabling it on all the shards though, to... [07:56:03] 10DBA, 06Operations, 10ops-eqiad: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3079117 (10Marostegui) The data transfer between db1067 and db1060 was started around 20 minutes ago. [08:04:10] 10DBA, 10Wikidata, 07Schema-change, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3079126 (10WMDE-leszek) Thank you @jcrespo and @Marostegui for your comments. I am happy I... [08:08:00] 10DBA, 10Wikidata, 07Schema-change, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3079129 (10Marostegui) >>! In T159718#3079126, @WMDE-leszek wrote: > Thank you @jcrespo an... [08:08:45] 10DBA, 10Wikidata, 07Schema-change, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3079130 (10WMDE-leszek) [08:12:53] 10DBA, 10Wikidata, 07Schema-change, 03Wikidata-Sprint: Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3079132 (10WMDE-leszek) [08:32:55] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#3079176 (10Marostegui) s5 done dbstore (1001,1002 - 2001 and 2002 had it enabled for a long time already) done [08:52:23] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#3079268 (10Marostegui) s3 is done [09:17:14] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#3079324 (10Marostegui) tendril host is done (not that it really needs it, but for consistency) s7 is done [10:16:25] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#3079458 (10Marostegui) x1 is now done. old labsdb1001 and 1003 also done (for consistency). [10:24:17] 10DBA, 10Wikidata, 07Schema-change, 03Wikidata-Sprint: Evaluate how to best add a column for full entity ID to wb_terms without affecting wikidata.org users - https://phabricator.wikimedia.org/T159718#3079459 (10WMDE-leszek) [11:20:28] 10DBA, 06Operations, 10ops-eqiad: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3079570 (10Marostegui) db1060 has been reimaged and recloned and it is now trying to catch up (GTID is enabled) [11:23:55] there is still one disk with 5 errors [11:26:34] on db1060? [11:27:52] 32:1 has 2 erros [11:28:04] mmm [11:28:15] it was 5 last time I checked, I am not kidding [11:28:27] the original ones that had issues (when the ticket was created) were #4 and #7 [11:28:30] so maybe this is new [11:28:42] note one has not been replaced [11:29:14] I see 4 on #4 [11:29:19] 5 on #5 [11:29:26] "other errors" [11:29:31] 5 on #4 [11:29:37] ah, I was grepping for Media error [11:29:45] and yes [11:29:53] now 2 media on #1 [11:37:20] 10DBA, 06Operations, 10ops-eqiad: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3079632 (10Marostegui) For the record, we are seeing the following disk errors (raid is fine and disks are online though): ``` #1 Media error count: 2... [12:25:16] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3079755 (10Marostegui) db2053: ``` root@neodymium:/home/marostegui/git/software/dbtools# for i in frwiki jawiki ruwiki; do echo $i;mysql --skip-ssl -hdb... [12:31:13] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3079759 (10Marostegui) db2039, which is the rc slave looks good compared to db1026 so no need to touch it: ``` PRIMARY KEY (`rev_id`,`rev_user`), K... [12:50:53] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#3079806 (10Marostegui) Deployed on s1. All the hosts in the .hosts files have this variable now deployed and enabled. [12:51:00] 10DBA: LabsDB infrastructure pending work - https://phabricator.wikimedia.org/T153058#3079808 (10Marostegui) [12:51:04] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#3079809 (10Marostegui) [12:51:06] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#3079807 (10Marostegui) 05Open>03Resolved [12:52:51] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#2751745 (10Marostegui) Note: Having this deployed and enabled doesn't mean we can enable GTID yet on multisource slaves (or at least in a non disruptive way, see: T149418#3070309 and T149418#30... [13:14:53] 10DBA, 10Wikidata, 07Performance, 15User-Daniel, 15User-Ladsgroup: Use redis-based lock manager in dispatch changes in production - https://phabricator.wikimedia.org/T159826#3079907 (10Ladsgroup) [13:15:24] 10DBA, 10Wikidata, 07Performance, 15User-Daniel, and 2 others: Use redis-based lock manager in dispatch changes in beta cluster - https://phabricator.wikimedia.org/T159828#3079937 (10Ladsgroup) [13:15:43] 10DBA, 10Wikidata, 07Performance, 15User-Daniel, and 2 others: Use redis-based lock manager in dispatch changes in production - https://phabricator.wikimedia.org/T159826#3079907 (10Ladsgroup) [13:51:24] 10DBA, 10Wikidata, 13Patch-For-Review, 07Performance, and 3 others: Implement ChangeDispatchCoordinator based on RedisLockManager - https://phabricator.wikimedia.org/T151993#3080083 (10Ladsgroup) For implementation details. Let's keep talking in {T159828} and {T159826} [13:53:00] 10DBA, 10Wikidata, 07Performance, 15User-Daniel: DispatchChanges: Avoid long-lasting connections to the master DB - https://phabricator.wikimedia.org/T151681#3080087 (10Ladsgroup) [13:53:05] 10DBA, 10Wikidata, 13Patch-For-Review, 07Performance, and 3 others: Implement ChangeDispatchCoordinator based on RedisLockManager - https://phabricator.wikimedia.org/T151993#3080086 (10Ladsgroup) 05Open>03Resolved [14:09:48] jynus, marostegui: FYI, I'm upgrading systemd on jessie database servers to the latest bugfix release from jessie 8.7. several db servers which have been installed in the recent weeks (like db1001 or db1022) already have that new version, so should be fully harmless [14:10:15] moritzm: sounds good! (and I guess db1060 which was reimaged today also has it) [14:10:24] ok to me [14:10:43] if there is an issue, it should be on dbproxies, not on db* [14:11:22] marostegui: yep, also e.g. db1095 [14:11:39] jynus: ok, I'll upgrade one dbproxy host ahead and keep an eye on it for a bit [14:12:01] plenty of other jessie hosts have the new version without problems, so don't expect any trouble [14:12:04] dbproxy1001 is depooled, you can start with that [14:12:51] ok, dbproxy1001 done [14:16:03] marostegui, when you have some time, I would like to deploy gerrit:340487 [14:16:12] I need coordination in case that breaks something [14:17:15] jynus: can we do it at 3.30? [14:17:25] whenever you can [14:17:31] ok, let's do it at 3:30 [14:17:33] I just need no db-related deployments [14:17:51] nor ongoing reboots, etc. [14:17:53] ah, I don't plan to deploy anything soon [14:18:45] that only writes to my.cnf- I want to avoid a reboot in case it goes wrong [14:21:57] feel free to go ahead now then [14:30:09] ok [14:33:36] we can run puppet and restart some servers [14:33:54] as long as they are not essential [14:34:07] I will try with labsdb1004 [14:35:24] ok [14:35:50] i can try db1067 (depooled) [14:35:52] the good news is that we will have start for tools [14:36:04] *stats (prometheus) [14:36:21] and I can add s1 for sanitarium, better something than all of them [14:36:30] *than nothing [14:37:21] it also removes the older certs [14:37:31] that were no longer in use (ssl) [14:39:13] \o/ https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?var-dc=eqiad%20prometheus%2Fops&var-group=labs&var-shard=tools&var-role=All&from=now-5m&to=now [14:39:55] no puppet failures, which is almost surprising [14:48:50] db1067 looked good too [14:51:09] mm [14:51:30] there is a socket on the location for labsdb1004 [14:51:41] but mysql is not running [14:51:54] I do not know how that can be [14:52:45] should I delete it? [14:52:54] lsof says no one is using it [14:52:57] I would say so because: 170307 14:48:56 mysqld_safe mysqld from pid file /srv/labsdb/data/labsdb1004.pid ended [14:53:02] so it is already looking for the new location [14:53:18] how can that have happened? [14:53:26] the only reason I can think of is [14:53:37] on crash? [14:53:43] my.cnf changes the location, the daemon stops and looks for the NEW location to delete the pid [14:53:53] interesting [14:54:01] it didn't crash, so that is discarded [14:54:12] it did, didn't it? [14:54:22] nope, no crash [14:54:23] I think even it was you who reported it [14:54:25] ? [14:54:29] and the uptime was low [14:54:52] oh, it crashed but 2 days ago [14:54:56] yes [14:54:58] was the my.cnf changed already? [14:55:03] oh, then that is it for sure [14:55:09] the socket never changed [14:55:14] but if it was that [14:55:18] it would have not started again? [14:55:42] let's try again [14:56:04] oh [14:56:11] I think I know what is happening [14:56:39] some permission problem [14:56:46] ah! [14:57:03] 170307 14:55:38 [ERROR] Can't start server : Bind on unix socket: Permission denied [14:57:06] 170307 14:55:38 [ERROR] Do you already have another mysqld server running on socket: /var/run/mysqld/mysqld.sock ? [14:57:09] 170307 14:55:38 [ERROR] Aborting [14:57:54] yeah [14:58:11] I am not sure if to puppetize this [14:58:29] because the socket path can change [14:58:43] but we do not want puppet to arbitrarely change a dir owner [14:59:02] maybe I can puppetize /var/run/mysqld hardcoded [14:59:20] that would work [14:59:21] and if someone sends the socket in other place, it is their problem to do it [14:59:32] for the dir [14:59:48] seem a good compromise beteween flexibility and security? [15:00:01] I think so [15:00:08] the pid should be there anyways [15:00:35] yes [15:00:50] we just have to painfully move it on every restart :-) [15:01:03] which is the whole poing of this commit [15:01:21] at least now we have the option [15:01:29] we can now add the yaml mapping [15:01:39] which should be easy to edit [15:02:23] ssl, the other thing that I changed, still works [15:02:32] so this was the complicated part of the deploy [15:02:59] I will do now the followups [15:03:04] you want to try to restart db1067? [15:03:07] it is depooled anyways [15:03:09] oh [15:03:36] so we can test on a core host? [15:03:43] sure, if that is possible [15:03:50] sure [15:03:51] let me do it [15:03:54] going to silence it [15:04:02] it still complains about "s51412__data"."book" [15:04:50] I may drop the 3 ignored dbs [15:05:05] and reimport them? [15:05:09] no [15:05:25] makes no sense if they are not up to date [15:05:44] what we can do is maybe create dumps from time to time [15:05:49] localy [15:06:54] db1067 deleted the socket with no issue [15:06:57] let's start it now [15:07:38] yeah, it is why tmp is so easy to handle [15:07:50] but also insecure [15:09:28] no issues with db1067 SSL [15:09:36] great [15:10:00] you are such a great working colleague, manuel! [15:10:21] are you joking now or not?! [15:10:29] I am so happy you are here to help me [15:10:34] I am being serious [15:10:41] I feel I slow you down! [15:10:48] wat? [15:10:49] no [15:11:08] you have been taking care of almost all incidents lately on your own [15:11:15] faster than me [15:12:00] maybe my phone operator sends the sms faster to me :) [15:12:07] ha ha [15:12:07] no, but I am trying to offload you as much as I can [15:12:46] so thoughts on db1051? [15:12:54] we said wednesday, right? [15:12:56] is that blocking you? [15:13:00] nope [15:13:01] no worries [15:13:05] I do not want to block you [15:13:15] I mean, I could do the alter, but 3 days isn't going to change anything [15:13:19] so I rather wait [15:13:33] but it may not finish... [15:13:47] In that case we need to also know that we cannot do analyze table on revision [15:13:50] which is worrying [15:13:55] yeah [15:14:04] it was started sunday, right? [15:14:31] i wish there was some strace for mysql processes [15:14:37] within mysql prompt [15:14:47] replicaiton is stopped no? [15:14:48] strace? [15:15:07] well, we could do that in some ways [15:15:29] yes, replication is stopped [15:15:54] iostat 1 100 [15:15:56] we could check the partition activity and see which one is being read [15:16:05] haha same thinking :) [15:16:17] it is not doing THAT much if you see that [15:23:04] 10DBA, 10Wikidata, 07Performance, 15User-Daniel, and 2 others: Build an environment to test change dispatching using Redis-based locking - https://phabricator.wikimedia.org/T155190#3080277 (10Ladsgroup) the instances are redis-dispatching-repo.wmflabs.org and redis-dispatching-client.wmflabs.org up and run... [15:23:23] 10DBA, 10Wikidata, 13Patch-For-Review, 07Performance, and 3 others: Implement ChangeDispatchCoordinator based on RedisLockManager - https://phabricator.wikimedia.org/T151993#3080280 (10Ladsgroup) [15:23:26] 10DBA, 10Wikidata, 07Performance, 15User-Daniel, and 2 others: Build an environment to test change dispatching using Redis-based locking - https://phabricator.wikimedia.org/T155190#3080279 (10Ladsgroup) 05Open>03Resolved [15:37:58] https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=19&fullscreen&var-server=db1051&var-network=eth0&from=now-7d&to=now (?) [15:38:32] interesting graph [15:38:57] maybe it finished scanning up the table? (?) [15:39:11] or it bail out [15:39:16] *bailed [15:41:45] wow, show engine innodb status takes AGES to run [15:42:00] yeah [15:42:07] also the monitoring takes 10 seconds [15:42:09] 1 row in set (27.29 sec) [15:42:31] but I do not see swap, like last time [15:44:20] i think it is doing stuff [15:44:23] look at this [15:44:45] "doing stuff" :-) [15:45:28] https://phabricator.wikimedia.org/P5021 [15:45:34] look at the "rows locks" [16:37:19] https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1069 [17:06:40] 10DBA, 06Operations, 10ops-codfw: Predictive disk failure on db2048 - https://phabricator.wikimedia.org/T159666#3080713 (10jcrespo) [17:07:02] 10DBA, 06Operations, 10ops-codfw: Predictive disk failure on db2048 - https://phabricator.wikimedia.org/T159666#3074469 (10jcrespo) It finally failed, see T159849 summary. [17:07:21] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159666#3080720 (10jcrespo) [17:07:55] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159666#3074469 (10jcrespo) ``` physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, Failed) ``` [17:10:18] 10DBA, 10Wikidata: Migrate wb_terms to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114903#1709272 (10WMDE-leszek) [17:16:53] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159666#3080832 (10Papaul) a:05Papaul>03Marostegui disk replacement complete [17:17:36] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3080836 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete [17:20:40] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 15User-Daniel: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3080842 (10Hall1467) [17:21:59] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3080845 (10Marostegui) Thanks! Disk is rebuilding! ``` root@db2044:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337F5EF0) Port Name: 1I Port Name: 2I... [17:22:52] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159666#3080848 (10Marostegui) Thanks - raid getting rebuilt ``` root@db2048:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337E3350) Gen8 ServBP 12+2 at Port 1I,... [18:17:03] 10DBA, 06Operations, 10hardware-requests, 10ops-codfw: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#3081047 (10RobH) [18:23:58] 10DBA, 06Operations, 10hardware-requests, 10ops-codfw: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#3081119 (10RobH) Switch ports disabled, diff below since the port info will be needed once these systems are unracked. [edit interfaces ge-6/0/0] - enable; + disable; [edit interfaces... [18:25:38] 10DBA, 06Operations, 10hardware-requests, 10ops-codfw: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#3081145 (10RobH) [18:35:08] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 15User-Daniel: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3081208 (10Hall1467) @jcrespo: Asking my question on here per your request. We were wondering what your thoughts are for the proposed l... [19:09:04] 10DBA, 06Operations, 10hardware-requests, 10ops-codfw, 13Patch-For-Review: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#3081365 (10RobH)