[03:36:08] <wikibugs>	 10DBA, 10MediaWiki-Page-deletion, 10Operations, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4125816 (10Anomie)
[03:36:11] <wikibugs>	 10DBA, 10MediaWiki-Page-deletion, 10Operations, 10MW-1.31-release-notes (WMF-deploy-2018-04-17 (1.31.0-wmf.30)), and 2 others: Reduce locking contention on deletion of pages - https://phabricator.wikimedia.org/T191892#4125813 (10Anomie) 05Open>03Resolved a:03Anomie This should be resolved now, for de...
[05:36:36] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125924 (10Marostegui) I have tried this on dewiki directly on the master with no issues. I also used: `SET SESSION innodb_lock_wait_timeout=1; SET SESSION lock_wait_timeout=15;`  Will do the same with s...
[05:37:08] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125925 (10Marostegui)
[05:39:47] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125927 (10Marostegui)
[05:46:22] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125941 (10Marostegui) I also did it on commons, which has a lot more load than s5 or s6, and there were no issues there as well.
[05:46:42] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125942 (10Marostegui)
[05:50:00] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125944 (10Marostegui)
[06:09:06] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125980 (10Marostegui)
[06:22:36] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125983 (10Marostegui)
[06:25:11] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125985 (10Marostegui) enwiki has been altered directly on the master without any issues.  For s3, I will alter codfw on the master and eqiad I will do slave by slave (there are just 3 of them + dbstore1...
[06:25:17] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125986 (10Marostegui)
[07:20:00] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4126042 (10Marostegui)
[07:20:35] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4083536 (10Marostegui) 05Open>03Resolved a:03Marostegui This is all done! One less drift between HEAD and production! Thanks! :)
[07:26:32] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4126056 (10Marostegui)
[07:35:05] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4126065 (10jcrespo) No issue or locking or strangeness of any kind on any server?
[07:35:52] <wikibugs>	 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126067 (10Marostegui)
[07:36:04] <wikibugs>	 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126080 (10Marostegui)
[07:36:07] <wikibugs>	 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3353952 (10Marostegui)
[07:36:22] <wikibugs>	 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126067 (10Marostegui) p:05Triage>03Normal
[07:37:15] <wikibugs>	 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3362343 (10Marostegui)
[07:40:51] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4126089 (10Marostegui) None of those :)
[07:47:45] <wikibugs>	 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126115 (10Marostegui) The table is empty everywhere, so there is no need to take a backup. For the record this is the structure:  ``` CREATE TABLE `linkscc` (   `lcc_pageid` int(10) unsigned NOT NULL DEFAULT '0',   `lcc_cacheobj` mediumblob N...
[07:50:00] <jynus>	 db2033 reimaged already? Do I clean it up from netboot?
[07:50:09] <marostegui>	 Yeah, it is all done
[07:51:11] <wikibugs>	 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126121 (10Marostegui)
[08:03:37] <jynus>	 I may need help, install of es1012 failed
[08:03:54] <jynus>	 no network detected
[08:04:25] <jynus>	 but maybe I broke the netboot.cfg?
[08:04:38] <marostegui>	 it didn't boot via PXE?
[08:04:43] <jynus>	 it did
[08:04:47] <marostegui>	 I mean, did it reach that point?
[08:04:53] <jynus>	 but network failed to be configured automatically
[08:05:06] <jynus>	 and the "early script" failed, too
[08:05:40] <jynus>	 I rebooted back to normal
[08:05:42] <marostegui>	 what happens if you configure it manually? it keeps failing?
[08:05:57] <jynus>	 but if you can help me check everthing I changed?
[08:06:00] <marostegui>	 sure
[08:06:16] <marostegui>	 I am checking the DHCP logs
[08:06:34] <jynus>	 https://gerrit.wikimedia.org/r/#/c/425492/
[08:07:04] <jynus>	 and https://gerrit.wikimedia.org/r/#/c/425765/
[08:08:18] <marostegui>	 so the DHCP detects the server finely
[08:08:53] <jynus>	 and actually, if it fails, it should fail on partitioning, not before
[08:09:10] <marostegui>	 yeah
[08:09:23] <jynus>	 should I just try again ¿
[08:09:30] <marostegui>	 try again and I will tail the log
[08:09:33] <marostegui>	 (the dhcp)
[08:09:39] <jynus>	 cool, thanks
[08:09:47] <jynus>	 this went without issue on es2XXX
[08:10:11] <marostegui>	 give me one sec
[08:10:16] <marostegui>	 I want to grab all the macs from that server
[08:10:29] <marostegui>	 done
[08:10:41] <jynus>	 ok, rebooting
[08:11:10] <wikibugs>	 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126178 (10Marostegui)
[08:11:37] <wikibugs>	 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126067 (10Marostegui)
[08:13:52] <jynus>	 maybe the network interfaces makes the installer confused? but they should be identical machines to the ones on codfw
[08:14:13] <jynus>	 except maybe not the latest bios/firmware version?
[08:14:23] <marostegui>	 I see now DHCP
[08:14:27] <marostegui>	 giving IP to it
[08:15:08] <jynus>	 nope
[08:15:15] <jynus>	  [!!] Download debconf preconfiguration file
[08:15:15] <marostegui>	 failed?
[08:15:24] <marostegui>	 can you configure the network manually to see what happens?
[08:15:35] <jynus>	 The netmask is used to determine which machines are local to your     │ network.
[08:15:35] <marostegui>	 inet addr:10.64.0.7  Bcast:10.64.3.255  Mask:255.255.252.0
[08:15:44] <jynus>	 I guess?
[08:15:52] <jynus>	 but this shouldn't happen
[08:15:58] <marostegui>	 yeah, but to see if it fails too
[08:16:19] <marostegui>	 gw 10.64.0.1
[08:17:06] <jynus>	 Execution of preseeded command "wget -O /tmp/early_command http://apt.wikimedia.org/autoinstall/scripts/early_command.sh && sh  /tmp/early_command" failed with exit code 10.
[08:18:48] <jynus>	 dns address208.80.154.238 ?
[08:18:53] <marostegui>	 let me check
[08:19:11] <marostegui>	  208.80.154.239 208.80.153.254
[08:20:01] <jynus>	 I think that worked (?)
[08:20:13] <marostegui>	 did it? :|
[08:20:28] <marostegui>	 maybe something wrong with DHCP?
[08:20:43] <jynus>	 at least it is not (yet) failing
[08:20:46] <marostegui>	 But I see it giving the correct IP: DHCPACK on 10.64.0.7 to 44:a8:42:35:67:f5 via 10.64.0.3
[08:21:02] <jynus>	 yeah, the installer did such request
[08:21:17] <jynus>	 and it looks like it succeded
[08:21:47] <jynus>	 I am going to bet some kind of network change? I know some people were restricting network usges
[08:21:52] <marostegui>	 if it fails again, maybe go to the console and check ifconfig and ping the gateway or try to download the file
[08:21:58] <marostegui>	 (I downloaded the file from db2033 and it worked)
[08:22:22] <jynus>	 but if it was network, dhcp would fail
[08:22:34] <jynus>	 maybe http was restricted, and only wget fails?
[08:22:43] <marostegui>	 yeah, but doing it manually worked
[08:22:46] <jynus>	 so it appears like a network failure but it is a protocl failure?
[08:22:52] <marostegui>	 I checked SAL and see nothing relevant btw
[08:22:52] <jynus>	 mmmh
[08:23:59] <jynus>	 it has to be an eqiad-only issue, it didn't happen on codfw
[08:24:07] <jynus>	 so maybe instal1002 issue?
[08:24:21] <jynus>	 loading components is very very slow
[08:24:38] <jynus>	 (installer components, which should take nothing)
[08:24:49] <marostegui>	 I was checking arzhel logs in phab to see if there is anything changed, and I cannot find any comments from him
[08:28:41] <wikibugs>	 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126196 (10Marostegui)
[08:28:54] <wikibugs>	 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126067 (10Marostegui) 05Open>03Resolved Table removed everywhere
[08:28:57] <wikibugs>	 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4126199 (10Marostegui)
[08:29:35] <wikibugs>	 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3377904 (10Marostegui)
[08:41:03] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4126226 (10Marostegui) a:03Papaul
[08:41:06] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4123236 (10Marostegui) p:05Triage>03Normal
[08:43:41] <wikibugs>	 10DBA, 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-notice, and 2 others: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960#4126232 (10Maroste...
[09:03:42] <wikibugs>	 10DBA: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126282 (10Marostegui) This host has dropped around 300 packets in 15h or so. Yesterday I checked the amount of drops in its interface and it was 1815, today it is 2103.  This is the amount of drops over eqiad s1 hosts, this host has the...
[09:06:09] <wikibugs>	 10DBA: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126288 (10jcrespo) The errors would be consistent with the 10-interval in which the connections happen (bursts of high activity). But not as large as thinking it is a hardware error.
[09:07:49] <jynus>	 marostegui: which graph are you checking- there are 2 sources, prometheus and graphite, and it may be on one of those, if missing from the other
[09:07:58] <marostegui>	 I am checking prometheus
[09:08:23] <marostegui>	 btw, I am not saying it is a HW error, I was throwing more stuff into the task for the sake of having more info with what I saw :)
[09:09:13] <jynus>	 https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1114&var-network=eno1
[09:09:18] <jynus>	 ^maybe that helps?
[09:09:26] <jynus>	 oh, I was agreeing with you
[09:09:46] <jynus>	 not disagreeing
[09:09:55] <marostegui>	 Oh, that helps! I forgot to change it hehe :)
[09:09:56] <marostegui>	 Thanks!
[09:10:05] <jynus>	 so that is graphite
[09:10:15] <jynus>	 were you checking that one?
[09:10:23] <moritzm>	 have the reimage issues been sorted out? I'm about to reimage mw1279 to stretch, maybe that helps narrowing it down
[09:10:35] <jynus>	 moritzm: please do
[09:10:41] <marostegui>	 jynus: yeah, but I am super silly and forgot to change the network iface :)
[09:10:52] <jynus>	 it went though but neetwork was very slow
[09:10:59] <jynus>	 apt update taking 20 minutes
[09:11:04] <jynus>	 somthing is wrong
[09:11:26] <jynus>	 if it goes fast on your install, then it is the host or its local network only
[09:12:34] <moritzm>	 ack, I'll report back
[09:13:44] <jynus>	 puppet agent -tv took a minute or so, something is weird still
[09:15:55] <jynus>	 (with no actual execution, I mean, first run should takes minutes as usual)
[09:22:52] <jynus>	 E: The repository 'http://apt.wikimedia.org/wikimedia stretch-wikimedia InRelease' is not signed.
[09:23:11] <jynus>	 so maybe that install failed?
[09:31:08] <moritzm>	 mhh, an "apt-get update" on a stretch host works, though (and that would also fail if the sig were really missing
[09:31:26] <jynus>	 I added it manually
[09:31:40] <volans>	 does d-i ask you for the GW?
[09:31:47] <jynus>	 yes
[09:31:54] <volans>	 last time it happened was for a syntax error in netboot.cfg
[09:32:04] <volans>	 I was looking at Ic5804be06ffe319860a233ec83eed0a8f83449f2, is that valid syntax?
[09:32:07] <jynus>	 ok, so my initial guess was correct
[09:32:10] <volans>	 |es101(1|[4-6]|8|9)|es201[1-9])
[09:32:18] <jynus>	 I will fix that
[09:32:21] <volans>	 I don't think parentheses are allowed in a case/esac
[09:32:25] <volans>	 not fully sure though
[09:32:45] <jynus>	 but how come it only half-fails?
[09:32:48] <jynus>	 :-)
[09:33:05] <volans>	 it fails for all the hosts after the error
[09:33:30] <jynus>	 no, I mean the install doesn't fail, it is just the network is slow
[09:33:58] <volans>	 it fails to get the preseed file, so all the configurations that we do there are gone
[09:34:19] <volans>	 maybe is using the public debian mirrors and not our internal one? dunno
[09:34:22] <volans>	 :)
[09:34:48] <jynus>	 I will fix and reimage again
[09:35:36] <jynus>	 that host shouldn't be able to access the network except through a proxy!
[09:37:38] <volans>	 yeah I don't know if maybe during d-i there are special rules tbh
[09:44:55] <wikibugs>	 10DBA: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126350 (10Marostegui) I have investigated some concrete timing of the connection issues correlated with logtash errors.  So far they all seem to match with the following type of query: ``` ApiQueryRevisions::run SELECT  rev_id,rev_page,...
[09:46:23] <wikibugs>	 10DBA: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126355 (10jcrespo) MMm, so api queries timing out get killed? That could be. But aren't those connection errors? Needs more research.
[09:49:08] <wikibugs>	 10DBA: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126367 (10Marostegui) Yeah, not completely sure, as we get lots of these on logtash:  ```  Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server db1114 is not replicating? ```
[09:55:11] <wikibugs>	 10DBA: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126376 (10jcrespo) That would explain the disconnections- too many connections leads to heartbeat check fails, which leads to disconnections.
[11:06:05] <jynus>	 volans: thanks, it was that
[11:11:50] <volans>	 yw :)
[11:31:24] <wikibugs>	 10DBA, 10MediaWiki-Watchlist, 10Wikidata, 10Performance, and 4 others: re-enable Wikidata Recent Changes integration on Russian Wikipedia - https://phabricator.wikimedia.org/T179012#4126553 (10TTO) 05Open>03Resolved Seems done. I assume the Russian community was notified? If not, they probably would ha...
[11:31:38] <wikibugs>	 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#4126555 (10TTO)
[11:31:44] <wikibugs>	 10DBA, 10Commons, 10MediaWiki-Watchlist, 10Wikidata, and 4 others: Re-enable Wikidata Recent Changes integration on Commons - https://phabricator.wikimedia.org/T179010#3710082 (10TTO) T179012 is complete. Perhaps it is time to revisit this?
[13:20:52] <jynus>	 are the connection errors gone?
[13:21:57] <jynus>	 it looks they are, since 12:20
[13:22:14] <jynus>	 12:10, actually
[13:23:45] <jynus>	 I am going to guess the 12:18 "Depool db1114 from API" had someting to do with it
[13:27:43] <marostegui>	 Yeah
[13:27:48] <marostegui>	 I am going to do another test
[13:27:52] <marostegui>	 Pool it back as it used to be
[13:27:55] <marostegui>	 See if they come back
[13:27:58] <marostegui>	 And then depool it from main
[13:28:21] <marostegui>	 I wanted to give it 1 hour, so in 30 minutes I will pool it back
[13:28:23] <jynus>	 there could be some missing indexes or something
[13:28:35] <jynus>	 or some strange config
[13:28:42] <marostegui>	 I can check the schemas too yeah
[13:35:18] <jynus>	 I don't know, I am just brainstorming
[13:43:15] <wikibugs>	 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126856 (10Marostegui) db1080 and db1114 have the same schema structure for revision table. However db1066 (the other API slave) has the incorrect indexes for the revision table: What it should have:  ```   KEY `use...
[13:43:17] <marostegui>	 jynus: ^
[13:43:59] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review, 10codfw-rollout: Database maintenance scheduled while eqiad datacenter is non primary (after the DC switchover) - https://phabricator.wikimedia.org/T155099#4126865 (10Marostegui)
[13:44:03] <wikibugs>	 10DBA, 10MediaWiki-API: Database query error (internal_api_error_DBQueryError) while getting list=allrevisions - https://phabricator.wikimedia.org/T123557#4126866 (10Marostegui)
[13:44:15] <wikibugs>	 10DBA, 10Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#4126861 (10Marostegui) 05Resolved>03Open T191996#4126856 db1066 needs fixing for the indexes
[13:47:19] <jynus>	 could that be not a fix, but a breakage?
[13:47:43] <jynus>	 probably not
[13:47:50] <marostegui>	 we will see with the next test, I am updating the task
[13:47:58] <marostegui>	 We should see the same errors on db1080 I would say
[13:48:32] <jynus>	 yah
[13:48:38] <wikibugs>	 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126886 (10Marostegui) So after depooling db1114 from API errors are gone, see: {F16934299}  As a next test I am going to repool db1114 as API, to make sure errors are back. Once they are, I will then depool it from...
[13:53:37] <jynus>	 aside from schema, I would do a diff of grants
[13:53:40] <jynus>	 just in case
[14:02:12] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4126969 (10Papaul) p:05Triage>03Normal
[14:11:33] <wikibugs>	 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127003 (10Anomie) The different indexes shouldn't make a difference (although I can't promise they don't). InnoDB's clustering should be appending the primary key on db1114 so the "real" indexes are what's explicit...
[14:15:32] <wikibugs>	 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127006 (10Marostegui) >>! In T191996#4126886, @Marostegui wrote: > So after depooling db1114 from API errors are gone, see: > {F16934299} >  > That graph is wrong it wasn't auto-refreshing. Errors kept going while...
[14:17:29] <wikibugs>	 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127022 (10Marostegui) db1080 uses the same (it has the same indexes as db1114). Reminder: db1066 and db1080 uses 10.0 and db1114 uses 10.  ```  +------+-------------+---------------------+--------+-----------------...
[14:19:08] <wikibugs>	 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127025 (10Anomie) >>! In T191996#4127003, @Anomie wrote: > As in T186266, the same problem exists if we simplify the query to the single table `SELECT * FROM revision WHERE rev_page = '19131234'  ORDER BY rev_times...
[14:21:42] <wikibugs>	 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127027 (10Marostegui) >>! In T191996#4127006, @Marostegui wrote: >>>! In T191996#4126886, @Marostegui wrote: >> So after depooling db1114 from API errors are gone, see: >> {F16934299} >>  >> > That graph is wrong i...
[14:31:46] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127041 (10Papaul) @Marostegui is it okay for me to reboot the server?
[14:31:54] <marostegui>	 jynus: ^
[14:32:01] <marostegui>	 (as you know better its state)
[14:32:40] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127043 (10Marostegui) @Papaul let me double check with @jcrespo as he is/was working with esXXXX servers.
[14:34:36] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127055 (10jcrespo) Not now, I will have to depool it. Give me 5 minutes.
[14:35:01] <marostegui>	 I can take care of that, I just wanted to ask you before, as you know better what you were doing or planning :)
[14:35:14] <marostegui>	 Let me know if you want me to take over :)
[14:39:45] <jynus>	 I want firt to reproduce the issue
[14:46:34] <wikibugs>	 10DBA, 10Operations: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127091 (10jcrespo) p:05Normal>03Low a:05Papaul>03jcrespo @Papaul @Marostegui Please don't do anything until it is clear what is the issue.
[14:52:27] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#4127133 (10Bstorm)
[14:55:15] <wikibugs>	 10DBA, 10Operations: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127152 (10jcrespo) Now that I have a way to test it, we can proceed, depooling:   ``` $ ipmitool -I lanplus -H es2013.mgmt.codfw.wmnet -U root -E chassis power status  Unable to read password from environment...
[15:06:14] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127179 (10jcrespo) a:05jcrespo>03Papaul @Papaul you are now free to handle the server- it is up, but with all the service down and depooled. I would try the reset I propos...
[15:06:29] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127182 (10jcrespo) p:05Low>03Normal
[15:07:21] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4123236 (10jcrespo) The reset a previous ticket suggested was T191977#4123270 (`racadm reset`)
[15:18:50] <wikibugs>	 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Apply schema changes to an isolated database and examine the results - https://phabricator.wikimedia.org/T191391#4127203 (10jcrespo) I found T86530, which may be outdated, but may help with givin...
[15:20:18] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127209 (10jcrespo)
[15:29:39] <wikibugs>	 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127233 (10Marostegui) >>! In T191996#4126367, @Marostegui wrote: > Yeah, not completely sure, as we also get lots of these on logtash:  > ``` >  Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server db1114 is no...
[15:55:15] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127273 (10Papaul) 1- Power drain 2- Reset IDRAC 3- Update BIOS from 2.1.7 to 2.7.1 4- Update IDRAC from 2.21 to 2.52
[15:56:02] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127277 (10Papaul) a:05Papaul>03jcrespo
[15:56:29] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127278 (10Marostegui) It is still not working: ``` root@neodymium:/home/marostegui#  ipmitool -I lanplus -H es2013.mgmt.codfw.wmnet -U root -E chassis power sta...
[16:08:59] <jynus>	 T191977#4127278 :-(
[16:08:59] <stashbot>	 T191977: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977
[16:09:11] <wikibugs>	 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127339 (10Marostegui) So, errors are back when the server is only on API. I am going to repool it normally for now. Tomorrow I will leave it pooled in API and will try to do a traffic capture during the burst of er...
[16:09:35] <marostegui>	 jynus: yeah, papaul asked me to reboot it one more time, so it is rebooting now
[16:09:51] <volans>	 jynus: does local ipmi works?
[16:09:58] <volans>	 if it's that maybe is just the remote disabled
[16:10:14] <jynus>	 what is local, the real machine?
[16:10:31] <volans>	 yes
[16:10:32] <jynus>	 or the mgmt server?
[16:10:41] <jynus>	 we will know on reboot :-)
[16:10:43] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127346 (10Marostegui) After the reboot  @papaul suggested, it still doesn't work :-(
[16:10:46] <volans>	 see the section Remote Lan Channel in https://phabricator.wikimedia.org/T150160#2951190
[16:11:07] <volans>	 if the diff shows some diff you can fix it changing --diff with --commit
[16:11:14] <volans>	 and redo the --diff to ensure there is no diff
[16:11:34] <jynus>	 ok, thanks
[16:11:45] <jynus>	 the interesting stuff is that this seems a new thing
[16:12:01] <jynus>	 at least it is not on your list at the time, and I guess you tested all?
[16:12:28] <jynus>	 volans: if that works, I will ask to double your pay
[16:13:08] <volans>	 at that time yes we fixed all of them in the end, and we have also some icinga checks
[16:13:16] <marostegui>	 oh that is a nice thing :)
[16:13:23] <jynus>	 this wasn't caught on monitoring
[16:13:26] <volans>	 but they cannot check everything because management card don't like to be pinged too much
[16:13:30] <jynus>	 the mgms systems works
[16:13:34] <volans>	 at least this is the wisdom of the elders
[16:13:48] <jynus>	 it is onle the remote thing, will check locally now
[16:13:52] <marostegui>	 it seems to be working locally
[16:14:36] <jynus>	 how did you test it, ipmitool is not istalled!
[16:14:50] <volans>	 is it a new reimaged host?
[16:14:57] <marostegui>	 ipmi-chassis
[16:14:58] <jynus>	 oh, there is other ipmi-related tools
[16:15:01] <jynus>	 yeah
[16:15:19] <volans>	 but the hardware is new or is an old one?
[16:15:24] <jynus>	 it is old
[16:15:31] <jynus>	 I just reimaged it recently
[16:15:40] <volans>	 so it should have been working at some point in the past
[16:15:41] <jynus>	 old but probably still in warranty
[16:15:49] <jynus>	 volans: that is what I said
[16:15:56] <jynus>	 it seemed to go wrong
[16:15:59] <volans>	 I'm wondering if re-imaging somehow makes this setting be reset
[16:16:08] <jynus>	 volans: no
[16:16:13] <jynus>	 I had to reimage it manually
[16:16:20] <jynus>	 because I could not make the script reboot it
[16:16:24] <jynus>	 so it wasn't that
[16:16:41] <jynus>	 but some of these es2 hosts changed the motherboard
[16:16:50] <jynus>	 maybe that reseted the bios or something?
[16:17:15] <volans>	 maybe, dunno. We had the same thing on eqsin hosts but those are new
[16:17:27] <volans>	 so I guess it was missed in the first configuration
[16:17:38] <jynus>	 marostegui: do I run the command?
[16:17:41] <marostegui>	 sure
[16:17:42] <marostegui>	 go ahead
[16:18:11] <marostegui>	 I asked papaul to join the channel as he can probably provide more info
[16:19:04] <jynus>	 I ran: "ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff"
[16:19:15] <jynus>	 but said nothing and asked for no password
[16:19:31] <jynus>	 and remote call still doesn't work
[16:19:37] <volans>	 if it shows no diff means that the remote calls are enabled already
[16:19:39] <volans>	 so is not that
[16:19:47] <jynus>	 :-(
[16:19:50] <volans>	 we have another long list of things to try though
[16:19:55] <jynus>	 lol
[16:20:02] <jynus>	 I may try them tomorrow
[16:20:04] <volans>	 one is that the root password might have been disaligned with the host's one
[16:20:15] <jynus>	 is on that ticket, right?
[16:20:37] <volans>	 on T150160 most of them yeah, but really we should find the time to document that wisdom
[16:20:38] <stashbot>	 T150160: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160
[16:20:57] <volans>	 e.ma also opened T191956 few days ago ;)
[16:20:57] <stashbot>	 T191956: Document how to fix IPMI issues on Wikitech  - https://phabricator.wikimedia.org/T191956
[16:21:10] <jynus>	 I will bookmark T150160 and try tomorrow
[16:21:22] <jynus>	 today we already tried the reset and power drain
[16:22:02] <jynus>	 this host will likely never get reimaged again
[16:22:23] <jynus>	 and management interface it is still accesible, so not really urgent
[16:23:17] <jynus>	 and I guess for those, I could try them on my own
[16:25:19] <volans>	 ack, the description of the task has most of the common issue/fixes
[16:25:37] <volans>	 there are some more hidden in related tasks, but I hope you can get away with one of the common ones
[16:39:09] <volans>	 jynus: fixed
[16:39:30] <volans>	 it was the password, I had tried myself the remote IPMI and noticed it failed right away, no timeout
[16:39:35] <volans>	 that usually indicates it's the password ;)
[16:46:37] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127531 (10Volans) 05Open>03Resolved I've fixed it, it was a case of password misalignment, see one of the cases described in T150160,
[20:33:39] <wikibugs>	 10DBA, 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-notice, and 2 others: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960#4128206 (10mmodell...