[07:22:45] 10DBA, 06Operations, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2968039 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1072.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage... [07:31:06] 10DBA, 06Operations, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2980985 (10Marostegui) I have upgraded db2012 to 10.0.29-2 (actually done a full upgrade) and the A... [07:46:28] 10DBA, 06Operations, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2980992 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1072.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['db1072.eqiad.wmnet']) ``` [07:50:43] 10DBA, 06Operations, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2980995 (10Marostegui) >>! In T156226#2980992, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ``` > ['db1072.eqiad.wmnet'] > ``` > > Of which those **FAILED**: > ``` > set(['db1072.... [07:55:08] 10DBA, 06Operations, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2980998 (10Marostegui) And from the reimage command outout: ``` sudo -E wmf-auto-reimage -p T156226 db1072.eqiad.wmnet START To monitor the full log: tail -F /var/log/wmf-auto-reimage/201701300722_m... [08:03:35] 10DBA, 06Operations, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2968039 (10MoritzMuehlenhoff) I've seen this once or twice during the app server reimages as well. IIRC it was related to a race in adding the salt key and difficult to fix in the current design of w... [08:13:28] 10DBA, 06Operations, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2981031 (10Marostegui) >>! In T156226#2981023, @MoritzMuehlenhoff wrote: > I've seen this once or twice during the app server reimages as well. IIRC it was related to a race in adding the salt key an... [14:07:41] let's see how 10.0.29 behaves in db1072 and if it catches up :-) https://grafana-admin.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1072&from=now-24h&to=now [15:01:13] 10DBA, 06Operations, 10ops-codfw: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2981946 (10Marostegui) [15:08:31] there is high number of dbquery errors since 8 today [15:09:04] most on wikidatawiki [15:09:16] wow [15:09:17] indeed [15:09:56] jobs [15:10:08] started at 7:35 or so [15:10:15] recentChangesUpdate [15:10:37] DatabaseMysqlBase::lock failed to acquire lock 'wikidatawiki-activeusers' [15:11:27] was there any deploy or something? [15:11:33] I was checking [15:11:40] but now [15:11:42] no [15:12:07] it started around 7:30 utc [15:12:15] and I do not see anything on SAL around that [15:12:34] it started at 8:05 [15:12:52] 2 events happened then [15:13:33] I do not believe db1073 has anything to do because it is s1 [15:13:51] unless the problem was latent [15:13:56] I did a deploy but not related to s5 [15:14:00] ah yes, that one [15:14:05] it was at 8:05 [15:14:24] the other thing is moritzm log [15:14:40] who I invoke here to see if he touched eqiad [15:15:59] it could be just a coincidence, but better ask on what we know than start by guessing [15:16:38] I am double checking my change [15:16:47] and I don't see anything wrong with it [15:18:17] jynus: I mostly deployed changes unrelated the db servers, mostly firejail/nss/java on non-DB hosts. the only thing that also touched db* today is tcpdump, which should be fine [15:18:26] no [15:18:28] I mean [15:18:33] if it is a problem [15:18:39] it would be on mw* [15:18:41] not dbs [15:18:59] dbs have no issues themselves [15:19:19] ah, ok. no, as far as mw* is concerned the only change today is the switch of the NTP servers in codfw to systemd-timesyncd [15:19:30] only codfw, are you sure? [15:19:42] yeah, that was configured via a Hiera knob [15:19:47] thanks [15:19:58] (I had to ask, as I said before) [15:20:01] and I also doublechecked with a puppet run on some mw1 host that it was a NOP for eqiad [15:20:04] sure! [15:20:22] 1.29.0-wmf.9, is this a recent update? [15:21:04] • 02:16 l10nupdate@tin: scap sync-l10n completed (1.29.0-wmf.9) (duration: 06m 11s) [15:21:06] group2 wikis to 1.29.0-wmf.9 on 1-27 [15:21:17] nah, that is only the translated messages [15:21:53] at this point I would poke wikidata devs [15:22:44] and deployers [15:23:22] I am checking the logs of one of the servers that complained but I don't see anything relevant [15:24:22] mw logs or db logs? [15:24:26] mw [15:28:53] https://phabricator.wikimedia.org/T156638 [15:29:51] let's see what they say [15:37:17] 10DBA: duplicate key problems - https://phabricator.wikimedia.org/T151029#2982095 (10jcrespo) 05Open>03Resolved [15:37:23] 07Blocked-on-schema-change, 10DBA, 10Wikimedia-Site-requests, 06Wikisource, and 3 others: Schema change for page content language - https://phabricator.wikimedia.org/T69223#2982096 (10jcrespo) [16:01:16] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2982158 (10Papaul) Disk replacement complete on slot 11 [16:05:04] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2982163 (10Marostegui) Thanks - It is getting rebuilt ``` root@db2011:/usr/local/bin# megacli -PDRbld -ShowProg -PhysDrv [32:11] -aALL Rebuild Progress on Device at Enclosure 32, Slot 11 Completed 44% in 1... [16:23:25] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2982242 (10Marostegui) Rebuilt finished successfully ``` Device Present ================ Virtual Drives : 1 Degraded : 0 Offline : 0 Physical Devices... [16:23:41] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2982243 (10Marostegui) 05Open>03Resolved a:03Papaul [17:58:31] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2982608 (10Papaul) @Robh we about to move db2034 in row c rack C6 to row A rack 5. I will like for you please if you have time to make some changes on the both switches .... [18:08:38] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2982644 (10RobH) >>! In T156478#2982608, @Papaul wrote: > @Robh we about to move db2034 in row c rack C6 to row A rack 5. I will like for you please if you have time to m... [18:43:43] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2982801 (10Papaul) @Marostegui server is now in A5. Just waiting for https://gerrit.wikimedia.org/r/#/c/335054/ to be merge. [19:15:54] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2982951 (10jcrespo) Merged. Virtual console is busy (I assume by yourself), so I do not have visibility of the state of the server right now. [21:11:06] 10DBA: Json_extract available on analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T156681#2983518 (10Nuria) [21:11:16] 10DBA, 10Analytics: Json_extract available on analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T156681#2983530 (10Nuria)