[03:36:08] 10DBA, 10MediaWiki-Page-deletion, 10Operations, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4125816 (10Anomie) [03:36:11] 10DBA, 10MediaWiki-Page-deletion, 10Operations, 10MW-1.31-release-notes (WMF-deploy-2018-04-17 (1.31.0-wmf.30)), and 2 others: Reduce locking contention on deletion of pages - https://phabricator.wikimedia.org/T191892#4125813 (10Anomie) 05Open>03Resolved a:03Anomie This should be resolved now, for de... [05:36:36] 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125924 (10Marostegui) I have tried this on dewiki directly on the master with no issues. I also used: `SET SESSION innodb_lock_wait_timeout=1; SET SESSION lock_wait_timeout=15;` Will do the same with s... [05:37:08] 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125925 (10Marostegui) [05:39:47] 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125927 (10Marostegui) [05:46:22] 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125941 (10Marostegui) I also did it on commons, which has a lot more load than s5 or s6, and there were no issues there as well. [05:46:42] 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125942 (10Marostegui) [05:50:00] 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125944 (10Marostegui) [06:09:06] 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125980 (10Marostegui) [06:22:36] 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125983 (10Marostegui) [06:25:11] 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125985 (10Marostegui) enwiki has been altered directly on the master without any issues. For s3, I will alter codfw on the master and eqiad I will do slave by slave (there are just 3 of them + dbstore1... [06:25:17] 10Blocked-on-schema-change, 10DBA: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4125986 (10Marostegui) [07:20:00] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4126042 (10Marostegui) [07:20:35] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4083536 (10Marostegui) 05Open>03Resolved a:03Marostegui This is all done! One less drift between HEAD and production! Thanks! :) [07:26:32] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4126056 (10Marostegui) [07:35:05] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4126065 (10jcrespo) No issue or locking or strangeness of any kind on any server? [07:35:52] 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126067 (10Marostegui) [07:36:04] 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126080 (10Marostegui) [07:36:07] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3353952 (10Marostegui) [07:36:22] 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126067 (10Marostegui) p:05Triage>03Normal [07:37:15] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3362343 (10Marostegui) [07:40:51] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780#4126089 (10Marostegui) None of those :) [07:47:45] 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126115 (10Marostegui) The table is empty everywhere, so there is no need to take a backup. For the record this is the structure: ``` CREATE TABLE `linkscc` ( `lcc_pageid` int(10) unsigned NOT NULL DEFAULT '0', `lcc_cacheobj` mediumblob N... [07:50:00] db2033 reimaged already? Do I clean it up from netboot? [07:50:09] Yeah, it is all done [07:51:11] 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126121 (10Marostegui) [08:03:37] I may need help, install of es1012 failed [08:03:54] no network detected [08:04:25] but maybe I broke the netboot.cfg? [08:04:38] it didn't boot via PXE? [08:04:43] it did [08:04:47] I mean, did it reach that point? [08:04:53] but network failed to be configured automatically [08:05:06] and the "early script" failed, too [08:05:40] I rebooted back to normal [08:05:42] what happens if you configure it manually? it keeps failing? [08:05:57] but if you can help me check everthing I changed? [08:06:00] sure [08:06:16] I am checking the DHCP logs [08:06:34] https://gerrit.wikimedia.org/r/#/c/425492/ [08:07:04] and https://gerrit.wikimedia.org/r/#/c/425765/ [08:08:18] so the DHCP detects the server finely [08:08:53] and actually, if it fails, it should fail on partitioning, not before [08:09:10] yeah [08:09:23] should I just try again ¿ [08:09:30] try again and I will tail the log [08:09:33] (the dhcp) [08:09:39] cool, thanks [08:09:47] this went without issue on es2XXX [08:10:11] give me one sec [08:10:16] I want to grab all the macs from that server [08:10:29] done [08:10:41] ok, rebooting [08:11:10] 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126178 (10Marostegui) [08:11:37] 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126067 (10Marostegui) [08:13:52] maybe the network interfaces makes the installer confused? but they should be identical machines to the ones on codfw [08:14:13] except maybe not the latest bios/firmware version? [08:14:23] I see now DHCP [08:14:27] giving IP to it [08:15:08] nope [08:15:15] [!!] Download debconf preconfiguration file [08:15:15] failed? [08:15:24] can you configure the network manually to see what happens? [08:15:35] The netmask is used to determine which machines are local to your │ network. [08:15:35] inet addr:10.64.0.7 Bcast:10.64.3.255 Mask:255.255.252.0 [08:15:44] I guess? [08:15:52] but this shouldn't happen [08:15:58] yeah, but to see if it fails too [08:16:19] gw 10.64.0.1 [08:17:06] Execution of preseeded command "wget -O /tmp/early_command http://apt.wikimedia.org/autoinstall/scripts/early_command.sh && sh /tmp/early_command" failed with exit code 10. [08:18:48] dns address208.80.154.238 ? [08:18:53] let me check [08:19:11] 208.80.154.239 208.80.153.254 [08:20:01] I think that worked (?) [08:20:13] did it? :| [08:20:28] maybe something wrong with DHCP? [08:20:43] at least it is not (yet) failing [08:20:46] But I see it giving the correct IP: DHCPACK on 10.64.0.7 to 44:a8:42:35:67:f5 via 10.64.0.3 [08:21:02] yeah, the installer did such request [08:21:17] and it looks like it succeded [08:21:47] I am going to bet some kind of network change? I know some people were restricting network usges [08:21:52] if it fails again, maybe go to the console and check ifconfig and ping the gateway or try to download the file [08:21:58] (I downloaded the file from db2033 and it worked) [08:22:22] but if it was network, dhcp would fail [08:22:34] maybe http was restricted, and only wget fails? [08:22:43] yeah, but doing it manually worked [08:22:46] so it appears like a network failure but it is a protocl failure? [08:22:52] I checked SAL and see nothing relevant btw [08:22:52] mmmh [08:23:59] it has to be an eqiad-only issue, it didn't happen on codfw [08:24:07] so maybe instal1002 issue? [08:24:21] loading components is very very slow [08:24:38] (installer components, which should take nothing) [08:24:49] I was checking arzhel logs in phab to see if there is anything changed, and I cannot find any comments from him [08:28:41] 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126196 (10Marostegui) [08:28:54] 10DBA: Drop table linkscc - https://phabricator.wikimedia.org/T192056#4126067 (10Marostegui) 05Open>03Resolved Table removed everywhere [08:28:57] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4126199 (10Marostegui) [08:29:35] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3377904 (10Marostegui) [08:41:03] 10DBA, 10Operations, 10ops-codfw: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4126226 (10Marostegui) a:03Papaul [08:41:06] 10DBA, 10Operations, 10ops-codfw: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4123236 (10Marostegui) p:05Triage>03Normal [08:43:41] 10DBA, 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-notice, and 2 others: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960#4126232 (10Maroste... [09:03:42] 10DBA: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126282 (10Marostegui) This host has dropped around 300 packets in 15h or so. Yesterday I checked the amount of drops in its interface and it was 1815, today it is 2103. This is the amount of drops over eqiad s1 hosts, this host has the... [09:06:09] 10DBA: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126288 (10jcrespo) The errors would be consistent with the 10-interval in which the connections happen (bursts of high activity). But not as large as thinking it is a hardware error. [09:07:49] marostegui: which graph are you checking- there are 2 sources, prometheus and graphite, and it may be on one of those, if missing from the other [09:07:58] I am checking prometheus [09:08:23] btw, I am not saying it is a HW error, I was throwing more stuff into the task for the sake of having more info with what I saw :) [09:09:13] https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1114&var-network=eno1 [09:09:18] ^maybe that helps? [09:09:26] oh, I was agreeing with you [09:09:46] not disagreeing [09:09:55] Oh, that helps! I forgot to change it hehe :) [09:09:56] Thanks! [09:10:05] so that is graphite [09:10:15] were you checking that one? [09:10:23] have the reimage issues been sorted out? I'm about to reimage mw1279 to stretch, maybe that helps narrowing it down [09:10:35] moritzm: please do [09:10:41] jynus: yeah, but I am super silly and forgot to change the network iface :) [09:10:52] it went though but neetwork was very slow [09:10:59] apt update taking 20 minutes [09:11:04] somthing is wrong [09:11:26] if it goes fast on your install, then it is the host or its local network only [09:12:34] ack, I'll report back [09:13:44] puppet agent -tv took a minute or so, something is weird still [09:15:55] (with no actual execution, I mean, first run should takes minutes as usual) [09:22:52] E: The repository 'http://apt.wikimedia.org/wikimedia stretch-wikimedia InRelease' is not signed. [09:23:11] so maybe that install failed? [09:31:08] mhh, an "apt-get update" on a stretch host works, though (and that would also fail if the sig were really missing [09:31:26] I added it manually [09:31:40] does d-i ask you for the GW? [09:31:47] yes [09:31:54] last time it happened was for a syntax error in netboot.cfg [09:32:04] I was looking at Ic5804be06ffe319860a233ec83eed0a8f83449f2, is that valid syntax? [09:32:07] ok, so my initial guess was correct [09:32:10] |es101(1|[4-6]|8|9)|es201[1-9]) [09:32:18] I will fix that [09:32:21] I don't think parentheses are allowed in a case/esac [09:32:25] not fully sure though [09:32:45] but how come it only half-fails? [09:32:48] :-) [09:33:05] it fails for all the hosts after the error [09:33:30] no, I mean the install doesn't fail, it is just the network is slow [09:33:58] it fails to get the preseed file, so all the configurations that we do there are gone [09:34:19] maybe is using the public debian mirrors and not our internal one? dunno [09:34:22] :) [09:34:48] I will fix and reimage again [09:35:36] that host shouldn't be able to access the network except through a proxy! [09:37:38] yeah I don't know if maybe during d-i there are special rules tbh [09:44:55] 10DBA: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126350 (10Marostegui) I have investigated some concrete timing of the connection issues correlated with logtash errors. So far they all seem to match with the following type of query: ``` ApiQueryRevisions::run SELECT rev_id,rev_page,... [09:46:23] 10DBA: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126355 (10jcrespo) MMm, so api queries timing out get killed? That could be. But aren't those connection errors? Needs more research. [09:49:08] 10DBA: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126367 (10Marostegui) Yeah, not completely sure, as we get lots of these on logtash: ``` Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server db1114 is not replicating? ``` [09:55:11] 10DBA: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126376 (10jcrespo) That would explain the disconnections- too many connections leads to heartbeat check fails, which leads to disconnections. [11:06:05] volans: thanks, it was that [11:11:50] yw :) [11:31:24] 10DBA, 10MediaWiki-Watchlist, 10Wikidata, 10Performance, and 4 others: re-enable Wikidata Recent Changes integration on Russian Wikipedia - https://phabricator.wikimedia.org/T179012#4126553 (10TTO) 05Open>03Resolved Seems done. I assume the Russian community was notified? If not, they probably would ha... [11:31:38] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#4126555 (10TTO) [11:31:44] 10DBA, 10Commons, 10MediaWiki-Watchlist, 10Wikidata, and 4 others: Re-enable Wikidata Recent Changes integration on Commons - https://phabricator.wikimedia.org/T179010#3710082 (10TTO) T179012 is complete. Perhaps it is time to revisit this? [13:20:52] are the connection errors gone? [13:21:57] it looks they are, since 12:20 [13:22:14] 12:10, actually [13:23:45] I am going to guess the 12:18 "Depool db1114 from API" had someting to do with it [13:27:43] Yeah [13:27:48] I am going to do another test [13:27:52] Pool it back as it used to be [13:27:55] See if they come back [13:27:58] And then depool it from main [13:28:21] I wanted to give it 1 hour, so in 30 minutes I will pool it back [13:28:23] there could be some missing indexes or something [13:28:35] or some strange config [13:28:42] I can check the schemas too yeah [13:35:18] I don't know, I am just brainstorming [13:43:15] 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126856 (10Marostegui) db1080 and db1114 have the same schema structure for revision table. However db1066 (the other API slave) has the incorrect indexes for the revision table: What it should have: ``` KEY `use... [13:43:17] jynus: ^ [13:43:59] 10DBA, 10Epic, 10Patch-For-Review, 10codfw-rollout: Database maintenance scheduled while eqiad datacenter is non primary (after the DC switchover) - https://phabricator.wikimedia.org/T155099#4126865 (10Marostegui) [13:44:03] 10DBA, 10MediaWiki-API: Database query error (internal_api_error_DBQueryError) while getting list=allrevisions - https://phabricator.wikimedia.org/T123557#4126866 (10Marostegui) [13:44:15] 10DBA, 10Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#4126861 (10Marostegui) 05Resolved>03Open T191996#4126856 db1066 needs fixing for the indexes [13:47:19] could that be not a fix, but a breakage? [13:47:43] probably not [13:47:50] we will see with the next test, I am updating the task [13:47:58] We should see the same errors on db1080 I would say [13:48:32] yah [13:48:38] 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126886 (10Marostegui) So after depooling db1114 from API errors are gone, see: {F16934299} As a next test I am going to repool db1114 as API, to make sure errors are back. Once they are, I will then depool it from... [13:53:37] aside from schema, I would do a diff of grants [13:53:40] just in case [14:02:12] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4126969 (10Papaul) p:05Triage>03Normal [14:11:33] 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127003 (10Anomie) The different indexes shouldn't make a difference (although I can't promise they don't). InnoDB's clustering should be appending the primary key on db1114 so the "real" indexes are what's explicit... [14:15:32] 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127006 (10Marostegui) >>! In T191996#4126886, @Marostegui wrote: > So after depooling db1114 from API errors are gone, see: > {F16934299} > > That graph is wrong it wasn't auto-refreshing. Errors kept going while... [14:17:29] 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127022 (10Marostegui) db1080 uses the same (it has the same indexes as db1114). Reminder: db1066 and db1080 uses 10.0 and db1114 uses 10. ``` +------+-------------+---------------------+--------+-----------------... [14:19:08] 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127025 (10Anomie) >>! In T191996#4127003, @Anomie wrote: > As in T186266, the same problem exists if we simplify the query to the single table `SELECT * FROM revision WHERE rev_page = '19131234' ORDER BY rev_times... [14:21:42] 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127027 (10Marostegui) >>! In T191996#4127006, @Marostegui wrote: >>>! In T191996#4126886, @Marostegui wrote: >> So after depooling db1114 from API errors are gone, see: >> {F16934299} >> >> > That graph is wrong i... [14:31:46] 10DBA, 10Operations, 10ops-codfw: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127041 (10Papaul) @Marostegui is it okay for me to reboot the server? [14:31:54] jynus: ^ [14:32:01] (as you know better its state) [14:32:40] 10DBA, 10Operations, 10ops-codfw: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127043 (10Marostegui) @Papaul let me double check with @jcrespo as he is/was working with esXXXX servers. [14:34:36] 10DBA, 10Operations, 10ops-codfw: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127055 (10jcrespo) Not now, I will have to depool it. Give me 5 minutes. [14:35:01] I can take care of that, I just wanted to ask you before, as you know better what you were doing or planning :) [14:35:14] Let me know if you want me to take over :) [14:39:45] I want firt to reproduce the issue [14:46:34] 10DBA, 10Operations: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127091 (10jcrespo) p:05Normal>03Low a:05Papaul>03jcrespo @Papaul @Marostegui Please don't do anything until it is clear what is the issue. [14:52:27] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#4127133 (10Bstorm) [14:55:15] 10DBA, 10Operations: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127152 (10jcrespo) Now that I have a way to test it, we can proceed, depooling: ``` $ ipmitool -I lanplus -H es2013.mgmt.codfw.wmnet -U root -E chassis power status Unable to read password from environment... [15:06:14] 10DBA, 10DC-Ops, 10Operations, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127179 (10jcrespo) a:05jcrespo>03Papaul @Papaul you are now free to handle the server- it is up, but with all the service down and depooled. I would try the reset I propos... [15:06:29] 10DBA, 10DC-Ops, 10Operations, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127182 (10jcrespo) p:05Low>03Normal [15:07:21] 10DBA, 10DC-Ops, 10Operations, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4123236 (10jcrespo) The reset a previous ticket suggested was T191977#4123270 (`racadm reset`) [15:18:50] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Apply schema changes to an isolated database and examine the results - https://phabricator.wikimedia.org/T191391#4127203 (10jcrespo) I found T86530, which may be outdated, but may help with givin... [15:20:18] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127209 (10jcrespo) [15:29:39] 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127233 (10Marostegui) >>! In T191996#4126367, @Marostegui wrote: > Yeah, not completely sure, as we also get lots of these on logtash: > ``` > Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server db1114 is no... [15:55:15] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127273 (10Papaul) 1- Power drain 2- Reset IDRAC 3- Update BIOS from 2.1.7 to 2.7.1 4- Update IDRAC from 2.21 to 2.52 [15:56:02] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127277 (10Papaul) a:05Papaul>03jcrespo [15:56:29] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127278 (10Marostegui) It is still not working: ``` root@neodymium:/home/marostegui# ipmitool -I lanplus -H es2013.mgmt.codfw.wmnet -U root -E chassis power sta... [16:08:59] T191977#4127278 :-( [16:08:59] T191977: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977 [16:09:11] 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4127339 (10Marostegui) So, errors are back when the server is only on API. I am going to repool it normally for now. Tomorrow I will leave it pooled in API and will try to do a traffic capture during the burst of er... [16:09:35] jynus: yeah, papaul asked me to reboot it one more time, so it is rebooting now [16:09:51] jynus: does local ipmi works? [16:09:58] if it's that maybe is just the remote disabled [16:10:14] what is local, the real machine? [16:10:31] yes [16:10:32] or the mgmt server? [16:10:41] we will know on reboot :-) [16:10:43] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127346 (10Marostegui) After the reboot  @papaul suggested, it still doesn't work :-( [16:10:46] see the section Remote Lan Channel in https://phabricator.wikimedia.org/T150160#2951190 [16:11:07] if the diff shows some diff you can fix it changing --diff with --commit [16:11:14] and redo the --diff to ensure there is no diff [16:11:34] ok, thanks [16:11:45] the interesting stuff is that this seems a new thing [16:12:01] at least it is not on your list at the time, and I guess you tested all? [16:12:28] volans: if that works, I will ask to double your pay [16:13:08] at that time yes we fixed all of them in the end, and we have also some icinga checks [16:13:16] oh that is a nice thing :) [16:13:23] this wasn't caught on monitoring [16:13:26] but they cannot check everything because management card don't like to be pinged too much [16:13:30] the mgms systems works [16:13:34] at least this is the wisdom of the elders [16:13:48] it is onle the remote thing, will check locally now [16:13:52] it seems to be working locally [16:14:36] how did you test it, ipmitool is not istalled! [16:14:50] is it a new reimaged host? [16:14:57] ipmi-chassis [16:14:58] oh, there is other ipmi-related tools [16:15:01] yeah [16:15:19] but the hardware is new or is an old one? [16:15:24] it is old [16:15:31] I just reimaged it recently [16:15:40] so it should have been working at some point in the past [16:15:41] old but probably still in warranty [16:15:49] volans: that is what I said [16:15:56] it seemed to go wrong [16:15:59] I'm wondering if re-imaging somehow makes this setting be reset [16:16:08] volans: no [16:16:13] I had to reimage it manually [16:16:20] because I could not make the script reboot it [16:16:24] so it wasn't that [16:16:41] but some of these es2 hosts changed the motherboard [16:16:50] maybe that reseted the bios or something? [16:17:15] maybe, dunno. We had the same thing on eqsin hosts but those are new [16:17:27] so I guess it was missed in the first configuration [16:17:38] marostegui: do I run the command? [16:17:41] sure [16:17:42] go ahead [16:18:11] I asked papaul to join the channel as he can probably provide more info [16:19:04] I ran: "ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff" [16:19:15] but said nothing and asked for no password [16:19:31] and remote call still doesn't work [16:19:37] if it shows no diff means that the remote calls are enabled already [16:19:39] so is not that [16:19:47] :-( [16:19:50] we have another long list of things to try though [16:19:55] lol [16:20:02] I may try them tomorrow [16:20:04] one is that the root password might have been disaligned with the host's one [16:20:15] is on that ticket, right? [16:20:37] on T150160 most of them yeah, but really we should find the time to document that wisdom [16:20:38] T150160: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160 [16:20:57] e.ma also opened T191956 few days ago ;) [16:20:57] T191956: Document how to fix IPMI issues on Wikitech - https://phabricator.wikimedia.org/T191956 [16:21:10] I will bookmark T150160 and try tomorrow [16:21:22] today we already tried the reset and power drain [16:22:02] this host will likely never get reimaged again [16:22:23] and management interface it is still accesible, so not really urgent [16:23:17] and I guess for those, I could try them on my own [16:25:19] ack, the description of the task has most of the common issue/fixes [16:25:37] there are some more hidden in related tasks, but I hope you can get away with one of the common ones [16:39:09] jynus: fixed [16:39:30] it was the password, I had tried myself the remote IPMI and noticed it failed right away, no timeout [16:39:35] that usually indicates it's the password ;) [16:46:37] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127531 (10Volans) 05Open>03Resolved I've fixed it, it was a case of password misalignment, see one of the cases described in T150160, [20:33:39] 10DBA, 10MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), 10Patch-For-Review, 10User-notice, and 2 others: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960#4128206 (10mmodell...