[07:22:45] <wikibugs>	 10DBA, 06Operations, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2968039 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1072.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage...
[07:31:06] <wikibugs>	 10DBA, 06Operations, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2980985 (10Marostegui) I have upgraded db2012 to 10.0.29-2 (actually done a full upgrade) and the A...
[07:46:28] <wikibugs>	 10DBA, 06Operations, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2980992 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1072.eqiad.wmnet'] ```  Of which those **FAILED**: ``` set(['db1072.eqiad.wmnet']) ```
[07:50:43] <wikibugs>	 10DBA, 06Operations, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2980995 (10Marostegui) >>! In T156226#2980992, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ``` > ['db1072.eqiad.wmnet'] > ``` >  > Of which those **FAILED**: > ``` > set(['db1072....
[07:55:08] <wikibugs>	 10DBA, 06Operations, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2980998 (10Marostegui) And from the reimage command outout:  ``` sudo -E wmf-auto-reimage -p T156226 db1072.eqiad.wmnet START To monitor the full log: tail -F /var/log/wmf-auto-reimage/201701300722_m...
[08:03:35] <wikibugs>	 10DBA, 06Operations, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2968039 (10MoritzMuehlenhoff) I've seen this once or twice during the app server reimages as well. IIRC it was related to a race in adding the salt key and difficult to fix in the current design of w...
[08:13:28] <wikibugs>	 10DBA, 06Operations, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2981031 (10Marostegui) >>! In T156226#2981023, @MoritzMuehlenhoff wrote: > I've seen this once or twice during the app server reimages as well. IIRC it was related to a race in adding the salt key an...
[14:07:41] <marostegui>	 let's see how 10.0.29 behaves in db1072 and if it catches up :-) https://grafana-admin.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1072&from=now-24h&to=now
[15:01:13] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2981946 (10Marostegui)
[15:08:31] <jynus>	 there is high number of dbquery errors since 8 today
[15:09:04] <jynus>	 most on wikidatawiki
[15:09:16] <marostegui>	 wow
[15:09:17] <marostegui>	 indeed
[15:09:56] <jynus>	 jobs
[15:10:08] <marostegui>	 started at 7:35 or so
[15:10:15] <jynus>	 recentChangesUpdate
[15:10:37] <jynus>	 DatabaseMysqlBase::lock failed to acquire lock 'wikidatawiki-activeusers'
[15:11:27] <jynus>	 was there any deploy or something?
[15:11:33] <marostegui>	 I was checking
[15:11:40] <marostegui>	 but now
[15:11:42] <marostegui>	 no
[15:12:07] <marostegui>	 it started around 7:30 utc
[15:12:15] <marostegui>	 and I do not see anything on SAL around that
[15:12:34] <jynus>	 it started at 8:05
[15:12:52] <jynus>	 2 events happened then
[15:13:33] <jynus>	 I do not believe db1073 has anything to do because it is s1
[15:13:51] <jynus>	 unless the problem was latent
[15:13:56] <marostegui>	 I did a deploy but not related to s5
[15:14:00] <marostegui>	 ah yes, that one
[15:14:05] <marostegui>	 it was at 8:05
[15:14:24] <jynus>	 the other thing is moritzm log
[15:14:40] <jynus>	 who I invoke here to see if he touched eqiad
[15:15:59] <jynus>	 it could be just a coincidence, but better ask on what we know than start by guessing
[15:16:38] <marostegui>	 I am double checking my change
[15:16:47] <marostegui>	 and I don't see anything wrong with it
[15:18:17] <moritzm>	 jynus: I mostly deployed changes unrelated the db servers, mostly firejail/nss/java on non-DB hosts. the only thing that also touched db* today is tcpdump, which should be fine
[15:18:26] <jynus>	 no
[15:18:28] <jynus>	 I mean
[15:18:33] <jynus>	 if it is a problem
[15:18:39] <jynus>	 it would be on mw*
[15:18:41] <jynus>	 not dbs
[15:18:59] <jynus>	 dbs have no issues themselves
[15:19:19] <moritzm>	 ah, ok. no, as far as mw* is concerned the only change today is the switch of the NTP servers in codfw to systemd-timesyncd
[15:19:30] <jynus>	 only codfw, are you sure?
[15:19:42] <moritzm>	 yeah, that was configured via a Hiera knob
[15:19:47] <jynus>	 thanks
[15:19:58] <jynus>	 (I had to ask, as I said before)
[15:20:01] <moritzm>	 and I also doublechecked with a puppet run on some mw1 host that it was a NOP for eqiad
[15:20:04] <moritzm>	 sure!
[15:20:22] <jynus>	 1.29.0-wmf.9, is this a recent update?
[15:21:04] <marostegui>	 	•	02:16 l10nupdate@tin: scap sync-l10n completed (1.29.0-wmf.9) (duration: 06m 11s)
[15:21:06] <jynus>	 group2 wikis to 1.29.0-wmf.9 on 1-27
[15:21:17] <jynus>	 nah, that is only the translated messages
[15:21:53] <jynus>	 at this point I would poke wikidata devs
[15:22:44] <jynus>	 and deployers
[15:23:22] <marostegui>	 I am checking the logs of one of the servers that complained but I don't see anything relevant
[15:24:22] <jynus>	 mw logs or db logs?
[15:24:26] <marostegui>	 mw
[15:28:53] <jynus>	 https://phabricator.wikimedia.org/T156638
[15:29:51] <marostegui>	 let's see what they say
[15:37:17] <wikibugs>	 10DBA: duplicate key problems - https://phabricator.wikimedia.org/T151029#2982095 (10jcrespo) 05Open>03Resolved
[15:37:23] <wikibugs>	 07Blocked-on-schema-change, 10DBA, 10Wikimedia-Site-requests, 06Wikisource, and 3 others: Schema change for page content language - https://phabricator.wikimedia.org/T69223#2982096 (10jcrespo)
[16:01:16] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2982158 (10Papaul) Disk replacement complete on slot 11
[16:05:04] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2982163 (10Marostegui) Thanks - It is getting rebuilt  ``` root@db2011:/usr/local/bin# megacli -PDRbld -ShowProg -PhysDrv [32:11] -aALL  Rebuild Progress on Device at Enclosure 32, Slot 11 Completed 44% in 1...
[16:23:25] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2982242 (10Marostegui) Rebuilt finished successfully   ```                 Device Present                 ================ Virtual Drives    : 1   Degraded        : 0   Offline         : 0 Physical Devices...
[16:23:41] <wikibugs>	 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2982243 (10Marostegui) 05Open>03Resolved a:03Papaul
[17:58:31] <wikibugs>	 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2982608 (10Papaul) @Robh we about to move db2034 in row c rack C6 to row A rack 5. I will like for you please if you have time to make some changes on the both switches ....
[18:08:38] <wikibugs>	 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2982644 (10RobH) >>! In T156478#2982608, @Papaul wrote: > @Robh we about to move db2034 in row c rack C6 to row A rack 5. I will like for you please if you have time to m...
[18:43:43] <wikibugs>	 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2982801 (10Papaul) @Marostegui server is now in A5. Just waiting for https://gerrit.wikimedia.org/r/#/c/335054/ to be merge.
[19:15:54] <wikibugs>	 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2982951 (10jcrespo) Merged. Virtual console is busy (I assume by yourself), so I do not have visibility of the state of the server right now.
[21:11:06] <wikibugs>	 10DBA: Json_extract available on analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T156681#2983518 (10Nuria)
[21:11:16] <wikibugs>	 10DBA, 10Analytics: Json_extract available on analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T156681#2983530 (10Nuria)