[02:47:30] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [04:04:30] PROBLEM - MariaDB sustained replica lag on db1143 is CRITICAL: 14.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1143&var-port=9104 [04:08:00] PROBLEM - MariaDB sustained replica lag on db1141 is CRITICAL: 14 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1141&var-port=9104 [04:13:50] RECOVERY - MariaDB sustained replica lag on db1141 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1141&var-port=9104 [04:14:12] RECOVERY - MariaDB sustained replica lag on db1143 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1143&var-port=9104 [05:15:14] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): enwiki database replicas (Toolforge and Cloud VPS) are more than 24h+ lagged - https://phabricator.wikimedia.org/T262239 (10Marostegui) @Krinkle when toolforge wiki replicas service was built, it was assumed that lag could happen anytime or even all the... [05:39:17] jynus: for when you are around: I am transferring 9TB in codfw between es2014 and es2026, yesterday with around 1TB copied transfer.py showed: WARNING: Firewall's temporary rule could not be deleted [05:39:17] ERROR: Copy from es2014.codfw.wmnet:/srv/sqldata to es2026.codfw.wmnet:/srv/ failed, however the nc processes were still open and data was still being transferred [05:39:24] I killed both and started the process again [05:39:44] Right now, after around 8.1TB out of 9TB data copied, the same issue has happened again, and I can see the data still being transfered [05:40:21] I am wondering if I should kill it again, or leave it finish as the files are still flying around, should this be considered corrupted or is this "known" and the data is ok as the netcat processes are running fine? [05:40:47] If you think I should start the process again, I think I will run it from cumin2001 now, as both times when it failed it was run from cumin1001 [05:43:58] the warning is not an issue, the error is [05:44:29] jynus: what's your advise? should I kill nc processes on both hosts and start it again? [05:44:40] yes [05:44:51] ok, will do and run it from cumin2001 this time, thanks [05:44:58] if it errors out it means some check failed [05:45:17] these are the options I am running: transfer.py --no-encrypt --no-checksum [05:45:54] are the 2 copies from 2 sets of independent hosts? [05:46:10] What do you mean? [05:46:35] are there 4 hosts involved on the 2 copies, no overlapping between them? [05:46:52] no, both tries were from es2014 to es2026 [05:47:15] there could be some leftover stuff from the copies [05:47:21] check the /tmp files [05:47:34] it is clean [05:47:50] I can upload the latest version of transfer.py, this one didn't have all the features [05:47:57] but ERRORS are not normal [05:48:03] let me try again from cumin2001 [05:48:16] I did a transfer yesterday and got no error [05:48:19] and if that fail, we can see options, as that would have meant 3 failures from 3 attempts [05:48:48] ok, started again [05:48:48] share the full command line before running it [05:48:55] ok, share the full command please [05:49:01] I did above: transfer.py --no-encrypt --no-checksum es2014.codfw.wmnet:/srv/sqldata es2026.codfw.wmnet:/srv/ [05:49:14] you can also run it with --verbose to get what is going on [05:49:20] ok [05:49:20] so we know what failed [05:49:55] running, let's see what happens [05:49:58] thanks for the helps [05:50:09] I will prepare the new package [05:50:16] this one was close, 8.1TB out of 9TB before it failed [05:50:27] the first one was 1.2TB or something [05:50:27] maybe there are some issues that are only fixed on the latest version [05:50:39] I will let you know how it goes, thanks [05:51:02] I don't know [05:51:13] maybe some network issues affected it yesterday? [05:51:38] but there is really not explanation to fail mid-transfer [05:51:45] unless network failed [05:52:02] normally it fails at the start, if it cannot do something [05:52:51] or give a warning at the end, for example, if the firewall hole is not there by the time it tries to reinstate it [05:55:05] Yes, I thought about the network issue yesterday too, so I was surprised to see it failing today in the morning too, we'll see this time, everything is within codfw now, cumin2001, and both hosts [05:58:55] yeah, it is weird it failed mid-transfer [05:59:18] I am uploading transferpy 1.0 to cumin2001 [05:59:33] we can try installing it, it has a few extra features and checks [06:01:29] if you are using it on cumin2001, maybe I can try it on cumin1001? [06:01:43] unless you have transfers ongoing there [06:02:24] marostegui ^ [06:03:03] you don't, only I do with backups, but I can live with it [06:03:19] yeah, also snapshots should be affected if there was something weird [06:03:38] so it is not like something new happened [06:03:58] I wonder if there could be some timeout or firewall issue if the transfer it too long? [06:04:26] honestly, this was never tested in production with 12 TB files [06:04:45] consider also skipping compression, for es hosts it may not have a lot of impact in transfer size [06:05:04] I will test transfer.py 1.0 on cumin1001 [06:06:49] I could see however a wan glitch over a multi-hour transfer [06:10:27] do you have an eqiad transfer to do so I could test it myself? [06:14:19] the new system also has a faster checksum algorithm, which you may be interested to enable by default [07:06:26] jynus: you can use cumin1001 yes [07:08:14] so it failed on cumin2001 after 79G: https://phabricator.wikimedia.org/P12529 I guess I should delete all the pending iptables rule and try again? But why would it fail after 79G and not inmediately? [07:08:59] yeah, that is why I say it would not make sense unless something external happened [07:11:48] do you want to try with the new package version? [07:12:46] I can install it right now [07:13:03] so a regular transfer worked for me [07:13:26] so maybe there is something special about the new es hosts, or some new issue with our env and large transfers [07:21:05] jynus: sure, install it on cumin2001 [07:21:06] and I can try it [07:21:13] I am going to clean the iptables rules from now [07:22:10] so ip tables should not be an issue [07:22:13] ok [07:22:31] if they fail to be modified, transfer literally cannot start [07:22:53] if they fail to be removed (e.g. because it has been reseted) that is at the end, and it only gives a warning [07:23:09] if it changes mid-transfer it should not be an issue because existing connections exception kicks in [07:23:16] ok, so let's try the new package on cumin2001 [07:23:28] so I don't see why that would be an issue, I am really confused [07:24:14] so this is unpuppetized [07:24:34] but check /etc/transferpy/transferpy.conf [07:24:50] and tune it to your liking (see how I set it up on eqiad ) [07:25:21] you probably want checksum = False and parallel_checksum=True [07:25:45] I will create a puppet patch when we are happy about this [07:26:05] the usage should be almost the same as now [07:26:40] run it with verbose just for debugging [07:27:03] or let me run it so I can check it at the same time, up to you [07:28:01] also at: https://doc.wikimedia.org/transferpy/master/usage.html [07:28:12] I will run it with https://phabricator.wikimedia.org/P12531 [07:28:27] looks ok to me [07:29:17] ok, running [07:29:18] (just to be clear options work as usual, these are just defaults to not write them on cmd line all the time) [07:29:31] we'll see how it goes [07:29:48] which hosts are used in this case? [07:29:55] es2014 -> es2026 [07:30:02] ok, will track them [07:30:11] ok, thanks [07:31:34] so far so good [07:31:56] Yes, the failures have been after: 79G, 1.2TB or something around that, and 8.1TB [07:31:59] one thing that you should be aware [07:32:17] is this version uses /tmp locks on hosts [07:32:34] those of course are removed after error or finish [07:32:46] but if for some reason there is an unclean stop (e.g. kill) [07:32:58] you may want to remove it or it will detect a transfer is running [07:33:16] e.g. es2026:/tmp/trnsfr_es2026.codfw.wmnet_4400.lock/ [07:33:27] obviously, don't remove them now :-D [07:34:15] I can see the expected iptables hole: ACCEPT tcp -- es2014.codfw.wmnet anywhere tcp dpt:4400 [07:34:26] so normal behaviour so far [07:35:09] look this other transfer I got [07:35:22] "2020-09-09 07:33:24 WARNING: Firewall's temporary rule could not be deleted" [07:35:26] that is a normal warning [07:35:41] but 020-09-09 07:33:23 INFO: 1018999073676 bytes correctly transferred from dbprov1002.eqiad.wmnet to db1133.eqiad.wmnet [07:35:45] but no errors [07:36:05] and of course backups have been running today with the same version [07:36:15] so maybe there is something weird with the new hosts [07:36:51] is there any chance that puppet could be clearing the temporary firewall rule? [07:37:03] yes, but that is not an issue [07:37:24] as once connection is on, firewall has a rule to not cut ongoing connections [07:37:52] in $LASTJOB we ran conntrack -F after reloading iptables [07:37:58] mmm [07:38:20] I was going to suggest that the behaviour could have changed [07:38:30] but I got a 1h30m transfer with no issue [07:38:36] 2020-09-09 06:09:30 INFO: About to transfer /srv/backups/snapshots/latest/snapshot.s1.2020-09-08--19-00-01.tar.gz from dbprov1002.eqiad.wmnet to ['db1133.eqiad.wmnet']:['/srv/sqldata.deleteme'] (396131352535 bytes) [07:38:44] 2020-09-09 07:33:23 INFO: 1018999073676 bytes correctly transferred from dbprov1002.eqiad.wmnet to db1133.eqiad.wmnet [07:39:20] so seeing what you say, but wouldn't that be shown on other transfers (e.g. backups)? [07:39:31] which take 1-2 hours [07:39:44] marostegui: how much time does that transfer would take? [07:39:52] the dbprov hosts are stretch, right? [07:40:00] does marostegui's transfer involve buster hosts? [07:40:02] jynus: Quite a few hours, it is 9TB [07:40:10] kormat: 2 stretch 1 buster [07:40:19] kormat: strech as source and buster as receiving host [07:40:36] marostegui: right. so if buster has somewhat different behaviour here, maaaybe that might explain it [07:40:45] (this is all just speculation) [07:40:58] kormat: I belive my transfer was exactly that stretch -> buster [07:41:06] but yeah, the fact they are newer hosts [07:41:10] and larger transfers [07:41:20] may have something to do with it, with some unaccounted factor [07:41:39] I am just discarding regular/easy explanations [07:42:01] but the 1h30m transfer to db1133.eqiad.wmnet was to a buster host [07:42:10] ack [07:42:33] to be fair, I think we never tested in dev or in production with 10TB [07:42:37] because you know, time [07:42:53] so there could be a bug hidding there [07:43:05] something related to a new env [07:43:44] `Sep 9 07:29:24 es2026 ulogd[1109]: [fw-in-drop] IN=eno3 OUT= MAC=bc:97:e1:57:ba:98:14:18:77:53:ae:c5:08:00 SRC=10.192.0.141 DST=10.192.0.36 LEN=60 TOS=00 PREC=0x00 TTL=64 ID=32665 DF PROTO=TCP SPT=42702 DPT=4400 SEQ=1170154264 ACK=0 WINDOW=29200 SYN URGP=0 MARK=0 ` [07:44:13] that's es2026 dropping a packet from es2014 [07:44:17] yeah, I saw a new firewall rule [07:44:27] for burst [07:44:47] NFLOG all -- anywhere anywhere limit: avg 1/sec burst 5 nflog-prefix "[fw-in-drop]" [07:44:59] but that seems to be a log, not a drop? [07:46:07] jynus: the default policy on the chain is DROP, and without the transfer.py rule, NFLOG is the last entry in the chain [07:46:41] really the transfer.py rule should go before this one 🤷‍♀️ [07:47:50] I think that is new [07:49:56] looks like it was added in https://gerrit.wikimedia.org/r/c/operations/puppet/+/490836 [07:50:56] it was created there but not applied yet? [07:52:00] from what i can see it has been in use since 2019-07 or so [07:52:02] but when ACCEPT tcp dpt:4400 [07:52:05] is deleted [07:52:15] state RELATED,ESTABLISHED kicks in [07:52:34] otherwise transfer would not complete [07:53:30] jynus: if everything worked as we expected, we wouldn't be having this problem :) so i'm just thinking about things that could go unexpectedly and somehow cause this to happen [07:53:47] which is fair, I am just pointing flaws on that [07:54:40] if it is firewall related, it is easy to check- stop puppet and any cron [07:54:52] that could reload ferm [07:54:58] and check if the error happens then [07:55:55] let's see what happens, this has happened kinda randomly, 79G, 1TB and 8.1TB which means it has been running for many hours or some minutes before it failed [07:56:11] do you have the exact timestamps? [07:56:12] marostegui: it would be useful to know the start/end times of the next attempt [07:56:29] maybe we could check if puppet runs happened at the same time [07:56:50] jynus: no, I don't it removed the screen already, I can have them for the next run, but I am not sure if the failure gives a timestamp though [07:57:07] yeah, new version does- it uses logger with the time [07:57:08] No, it doesn't WARNING: Firewall's temporary rule could not be deleted [07:57:09] ERROR: Copy from es2014.codfw.wmnet:/srv/sqldata to es2026.codfw.wmnet:/srv/ failed [07:57:16] ok, so let's wait for it to fail [07:57:17] marostegui: do `date; transfer.py ....; date` [07:57:25] (or on the next run) [07:57:33] yeah, that is the old version, new version should have timestamps? [07:57:34] This is the first run with the new version, so we'll see [07:57:48] can you see if debugging has it at the start of the execution on current run? [07:57:55] jynus: yes, it does [07:58:02] so that should be enough [07:59:29] everything ok so far, right, do you have the start time? [07:59:43] 2020-09-09 07:29:17 INFO: About to transfer /srv/sqldata from es2014.codfw.wmnet to ['es2026.codfw.wmnet']:['/srv/'] (9366738014294 bytes) [08:00:02] Again, yesterday this failed after 1TB and then on the second run after 8TB, so that's lots of hours into the transfer [08:00:09] I have a theory [08:00:37] maybe the issue happens when ferm is reloaded, which may not happen on every puppet run [08:00:46] but only when network preferences change [08:01:05] killing ongoing connections or something [08:01:22] so it could be a race condition if someone deploys a change like that [08:01:34] jynus: we can check on puppet log if there was a ferm reload [08:01:59] there was a puppet run at 07:53:49 [08:02:25] do you remember the approximate time of the last error? [08:02:43] 7:53 is like 10 minutes ago, so it would have killed this transfer [08:02:49] oh, sorry [08:02:51] wrong host [08:03:06] mmm [08:03:09] I am very confused now [08:03:11] 07:58:24 [08:03:31] /Stage[main]/Mariadb::Config/File[/srv/sqldata]/owner) owner changed 'root' [08:03:32] WARNING: /tmp/trnsfr_es2026.codfw.wmnet_4400.lock temporary directory could not be deleted [08:03:32] 2020-09-09 07:59:49 WARNING: Firewall's temporary rule could not be deleted [08:03:32] 2020-09-09 07:59:49 INFO: Cleaning up.... [08:03:36] that could be too^ [08:03:49] changing the ownership by puppet to root [08:04:03] as well as: /Stage[main]/Ferm/Service[ferm]/ensure) ensure changed 'stopped' to 'running' [08:04:11] https://phabricator.wikimedia.org/P12529#70050 [08:04:17] (/Stage[main]/Ferm/Service[ferm]) Unscheduling refresh on Service[ferm] [08:04:40] packet_write_wait: Connection to 10.192.0.141 port 22: Broken pipe [08:04:54] that could be maybe someone cd ing into the dir [08:05:03] so it fails to be removed [08:05:32] but that is the clean up phase, the error must be before [08:05:40] yes, the error is that broken pipe [08:05:42] I just pasted [08:05:46] but there is no timestamp on that one [08:06:31] sorry, which paste, I don't see it? [08:06:42] but given the timestamps it is possible that Connection to 10.192.0.141 port 22: Broken pipe is from the ferm reload [08:06:51] jynus: [10:04:40] packet_write_wait: Connection to 10.192.0.141 port 22: Broken pipe [08:07:08] and after that this happened: https://phabricator.wikimedia.org/P12529#70050 [08:07:31] yeah, that last part is not relevant, it is just cleaning up after error [08:07:54] it is the Broken pipe that broke the transfer [08:07:54] It is possible that Sep 9 07:58:24 es2026 puppet-agent[21980]: (/Stage[main]/Ferm/Service[ferm]) Unscheduling refresh on Service[ferm] triggered the "Connection to 10.192.0.141 port 22: Broken pipe" [08:08:02] There is no timestamp on "Connection to 10.192.0.141 port 22: Broken pipe" but it could be the case [08:08:03] indeed [08:08:11] to be 100% sure [08:08:16] I am going to stop puppet [08:08:17] and try again [08:08:20] exactly [08:08:22] one last thing [08:08:40] you will have to manually remove the tmp subdir [08:08:50] as it couldn't be removed by the cleanup process [08:08:56] mmm, there is none [08:09:03] interesting [08:09:08] https://phabricator.wikimedia.org/P12532 [08:09:08] maybe it is a safeguard [08:09:30] yeah, maybe that is an exception handler that tries to cleanup everything [08:09:55] so error of the error handler, not a big deal, minor ui bug [08:09:58] should I stop puppet on the source too? [08:10:02] just to be sure? [08:10:11] nah, I bet is target [08:10:20] ok, starting the transfer again then [08:10:32] so question [08:10:52] last time, it must have run for a long time to transfer 8TB [08:11:05] more than the 20 minutes between puppet runs? [08:11:20] maybe there were no puppet runs that reloaded ferm during those hours [08:11:34] could this be a newly feature recently deploy to puppet? [08:11:46] and why it doesn't affect other transfers? [08:12:52] kormat: I agree that the position of the firewall hole is wrong [08:12:59] but I really think that is relatively new [08:13:45] jynus: worth a task to discuss there? [08:14:00] let's see how the last transfer goes [08:14:05] ok [08:14:08] in fact, let me run puppet manually [08:14:12] is it ongoing? [08:14:15] yes [08:14:34] actually, I cannot run it because it is disbled, my fault [08:14:38] let's see how it goes [08:14:43] and I can reproduce it on another host [08:14:50] we can enable it [08:14:54] nah [08:14:55] let it go [08:14:56] it has only done 15G [08:14:59] you have one task [08:15:13] which is populate the hosts [08:15:16] let me handle the error [08:15:21] ok [08:15:30] I can now test on the new test host :-D [08:18:50] so this is the thing, I just ran a file transfer from dbprov1002.eqiad.wmnet to db1133 and ran puppet on both places [08:18:59] iptables reloaded, as expected [08:19:42] but as the connection is established it "works" (even if the position should be better to avoid logging) [08:19:51] so maybe it is not iptables, but chown? [08:20:07] jynus: enable puppet on es2026 and try there, it is ok, it is just 35G [08:20:08] it was the other thing that happend on puppet run [08:20:18] 1 second [08:20:59] owner changed 'root' to 'mysql' [08:21:09] group changed 'root' to 'mysql' [08:21:15] mode changed '0700' to '0755' [08:21:33] could that have caused write to fail, rather than network? [08:21:59] jynus: that would have happened on the first puppet run after the transfer started I would guess [08:22:06] jynus: try running puppet on es2026 [08:23:03] Notice: /Stage[main]/Mariadb::Config/File[/srv/sqldata]/owner: owner changed 'root' to 'mysql' [08:23:04] Notice: /Stage[main]/Mariadb::Config/File[/srv/sqldata]/group: group changed 'root' to 'mysql' [08:23:06] Notice: /Stage[main]/Mariadb::Config/File[/srv/sqldata]/mode: mode changed '0700' to '0755' [08:23:07] Notice: /Stage[main]/Ferm/Service[ferm]/ensure: ensure changed 'stopped' to 'running' [08:23:16] 08:23:07 after finishing [08:24:06] to test it is chmod [08:24:18] transfer to a different dir [08:24:26] (unless it didn't fail now) [08:25:00] nah, it is transfering still [08:26:29] btw, there is 2 ongoing transfers, one on port 4400 and another on port 4401 [08:26:57] I have no idea what that is [08:27:01] I am already lost with all the stuff [08:27:22] you have to ongoing copy processes [08:27:24] *two [08:27:36] Last time I checked there were no nc process before I started the last transfer [08:28:01] https://phabricator.wikimedia.org/P12533 [08:28:03] neither tmp locks or anything, so I dont know [08:28:26] one started at 07:29 [08:28:30] jynus: I know, but what I am saying is that when I started the new transfer there were no nc processes on either the target or the source [08:28:33] and another at 08:11 [08:28:35] So I have no idea how that came alive [08:28:47] let me kill them [08:30:40] do you want me to try? [08:30:51] sure [08:30:52] do you have a screen you use for this? [08:31:00] yes, on cumin2001 screen -x transfer [08:31:15] and puppet is stopped on es2026 [08:32:19] I am going to copy to another dir [08:32:22] just to be sure [08:32:33] ok [08:32:52] root@es2026:/srv$ mkdir transfer [08:34:24] so far so good [08:34:28] now let's wait [08:34:59] maybe the logging gets overloaded and makes things fail [08:38:49] so there could be a bug somewhere, my main issue is that I cannot reproduce it elsewhere [08:41:57] I will try also moving another es backup to a buster host [08:43:58] As I said earlier, I think it is easier if we track all this on a task, as we are trying many different things and it is easy to get lost on the investigation paths on irc logs, as there are many [08:44:07] +1 [08:47:14] ok to me [09:35:06] Well this was a long read in the morning. Is it continuing in a task? [09:36:07] I think manuel was going to create one [09:36:12] Was I? [09:36:17] oh [09:36:29] I thought you were going to create it as you suggested we move there [09:36:29] Ok, let me create it [09:37:18] I will add my ongoing debugging, ofc [09:40:09] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10Marostegui) [09:40:22] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10Marostegui) p:05Triage→03Medium [09:40:36] I have created that ^ please expand with your thoughts/tests/theory [09:40:51] genuine question [09:40:59] the ", however the nc processes were still open and data was still being transferred" [09:41:08] was part of the error or your comment? [09:41:17] I am guessing your comment, right? [09:41:39] I can add that to the task [09:41:45] it is there [09:41:52] I mean if it was just a formatting mistake [09:41:59] I am 99% sure it is [09:42:08] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10Marostegui) [09:42:32] let me show you, I may not be explaining good enough [09:43:50] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10jcrespo) [09:44:04] let me know if my last edit is correct [09:44:14] I will add my ongoing debugging [09:44:37] that last part is very interesting [09:45:11] jynus: The diff is weird, it looks like you just moved my sentence below? [09:45:26] but yes, essentially that [09:45:29] yep, outside of the error log, is that correct? [09:45:33] I think it's to clarify that this was not part of the error [09:45:34] yep [09:45:51] I wouldn't discard having that as error, honestly [09:46:16] I am commenting now [09:46:21] cool thanks [09:46:42] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10Marostegui) [09:50:19] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10LSobanski) Despite having just read the conversation history, I already forgot, did we try a smaller transfer between these particular hosts to see if it completes? Also, when was the last time a transfer b... [09:53:48] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10jcrespo) The error seems to be es2026-only (so far), as similar transfers (e.g. all snapshots backups) happened without errors. However there are several things that could affect only this transfer: * This... [09:54:23] ^I just answered you sobanski withough knowing [09:55:12] my bet after manuel clarifications is that network issues yesterday maybe left bad state on source or target, and that interacts by having multiple runs concurrently [09:55:49] jynus: but today we had issues when starting a new transfer too [09:55:57] yes [09:56:04] I am betting more on puppet [09:56:31] but that wouldn't fit with my 6TB transfer ongoing for 1h? [09:57:03] Yesterday the transfer went thru the network issues with no problem, I checked before leaving for the day as I was worried that the networking issue could have messed it up [09:57:05] I think it is es2026 specific, but have no proof of what exactly [09:57:12] and it didn't error...and today in the morning I found the error [09:57:25] yeah, but I found several processes ongoing there [09:57:32] but with today's date [09:57:36] true [09:57:55] do you have another es host to setup? [09:58:00] not yet [09:58:01] I am guessing no due to redundancy [09:58:04] yeah [09:58:08] * sobanski away (meeting) [09:58:09] I think the key will be that [09:58:37] Check this https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=8&orgId=1&refresh=5m&var-server=es2026&var-datasource=thanos&var-cluster=misc&from=now-24h&to=now [09:58:58] So at 06:00 is when I killed the existing nc processes that were still running from the previous day [09:59:14] So even if transferpy errored, the data kept being transferred and hence it had 8TB copied [10:00:04] I don't get that last part [10:00:17] Ok, let me clarify [10:00:18] rx stopped at 6am? [10:00:28] as I can see [10:00:34] jynus: let me explain [10:00:36] or you mean it errored out before? [10:00:54] jynus: let me write all the workflow [10:01:35] Yesterday in the morning I started the transfer [10:01:43] Then network issues arised and I was worried it would have killed it [10:01:55] I checked during and after the network issues and the transfer was still running [10:02:04] I left for the day and checked again, the transfer was still running [10:03:02] Today in the morning I came and checked, and transferpy showed the error, but the nc processes were still present on both hosts and sending file (i did an strace and I also confirmed by checking /srv/sqldata on the target and it was growing), I double checked with you on IRC and after your confirmation that the data wasn't really reliable, I killed both nc processes on and you can see around 06:00 that drop [10:03:15] The total data transfered during those hours was around 8.1TB out of 9TB [10:03:26] Does this help to get the whole picture on what was done and what I found in the morning? [10:03:31] I see, I can comlete the picture [10:03:52] at 8:28 [10:03:59] I saw there was 2 nc processes [10:04:12] that could have been after we started the tests after our first irc chat [10:04:20] one starting at 07:29 [10:04:35] and another at 08:11 [10:04:45] that matches the graph [10:05:02] but taht wouldn't explain your second error [10:05:02] I have no idea how that can happen as before starting any transfer I always look for nc processes to avoid that [10:05:12] that is ok, not a big deal right now [10:05:15] the second error was around 7:59 I think, which kinda matches the puppet run [10:06:03] I give you that, the issue is that really doesn't fit with puppet not killing transfers elsewere, so I am looking for something special here [10:06:20] (I don't know what) [10:06:35] I don't know either, I am just stating what I have seen/done [10:06:44] I understood you know [10:10:51] jynus: did you enable puppet on es2026? [10:11:21] I did on a previous run [10:11:30] cool, so that's good, as nothing has broken yet [10:11:41] not touch it since last run, not sure if you re-disabled it [10:12:29] no, no [10:13:06] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10Marostegui) More food for thought: One of the errors was pretty close and sort of matched: ` Sep 9 07:58:20 es2026 puppet-agent[21980]: (/Stage[main]/Mariadb::Config/File[/srv/sqldata]/owner) owner change... [10:13:18] another option I could think is the chmod only breaks it when it happens to be writing to the root dir [10:13:31] and that is more likely on es? [10:13:46] on root dir? [10:13:54] some weird race condition like that, undetected for years [10:13:55] on /root? [10:14:13] root of transfer, /srv/sqldata previously [10:14:47] but nothing should be writing to it apart from transfer itself [10:15:17] yeah, but not sure how atomic is the chmod and the chown [10:15:42] we can test that now with the ongoing transfer if needed [10:16:02] maybe there is a second in which writes are disabled or something [10:16:22] feel free to test with es2026 and the current transfer [10:16:23] don't know, I am just guessing [10:16:28] change them and run puppet [10:16:35] it is being tested in a way [10:16:43] by transfering to a separate dir [10:16:53] ok [10:16:57] if it works, I would suspect chown rather than iptables [10:17:24] I was going to say that I would be highly surprised it is that [10:17:32] but in reality, we don't really do much clonings [10:17:36] *any [10:17:38] *many [10:17:51] maybe the first in years? [10:18:15] as --type=decompress works slightly differently [10:18:47] plus for obvious reasons tests in production don't usually use /srv/sqldata [10:20:11] for tracking, this is also the dbprov test transfer: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=8&orgId=1&refresh=5m&from=now-24h&to=now&var-server=dbprov1003&var-datasource=thanos&var-cluster=mysql [10:20:22] slightly faster becaue uses 10G [10:21:36] meanwhile I can work on (installing the temporary iptables rule second to last?) [10:21:55] that will avoid false positives on drop logs [10:23:45] note kormat that 2019 is "new" to me :-D [11:01:17] Going to stop s2 eqiad master in preparation for the PDU maintenance [11:02:21] oh, I forgot [11:02:27] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10Marostegui) Is there anything pending here that might require power changes? [11:02:30] hopefully it goes smoother this time [11:04:22] wondering if I should stop a backup source, too [11:05:00] which one is affected? [11:05:19] ah, no, it is a redundant one [11:05:23] will keep it up [11:05:27] db1140 [11:05:32] * sobanski back [11:05:46] jynus: I don't see that one on: https://phabricator.wikimedia.org/T261453#6417370 [11:05:57] oh, I got the wrong list [11:06:07] that was yesterday [11:07:26] I don't see anything else too worrying [13:05:38] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [13:06:36] starting s2 master, per the email from chris, the pdu work needs to be postponed [13:06:40] s2 eqiad master, that is [13:11:18] I didn't get an email but I confirmed with Chris that he's pausing for the day [13:16:27] transfer and test transfer seem to be still ongoing normally [13:16:51] 2TB transfered [13:55:14] I am going to restart mysql on tendril [13:55:20] jynus: kormat ^ is that ok? [13:55:33] yep, no issue [13:55:34] go for it [13:55:38] one thing [13:55:39] ok thanks [13:55:43] check memory before and after [13:55:57] actually not need to "check" it will be on prometheus :-D [13:56:03] I am restarting because of memory, there's a warning for memory usage, so I want to do it now before it crashes [13:56:17] oh, I thought it was just the microcode [13:56:30] didn't see the warning [13:56:39] I guess it is the possible leak with the events [13:56:41] but I had seen it on grafana [13:56:44] So I am going to go for it [13:56:44] :-( [14:03:50] tendril is back [14:12:24] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [14:12:41] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) 05Open→03Resolved All done [14:18:32] * sobanski stepping away [14:50:43] 10DBA, 10Google-Summer-of-Code (2020), 10Patch-For-Review: GSoC 2020 Proposal: Improve the framework to transfer files over the LAN - https://phabricator.wikimedia.org/T248256 (10jcrespo) [14:51:00] 10DBA, 10Patch-For-Review: transferpy 1.0 release - https://phabricator.wikimedia.org/T257601 (10jcrespo) 05Open→03Resolved Latest (1.0) transferpy available at https://apt.wikimedia.org/wikimedia/pool/main/t/transferpy/transferpy_1.0_amd64.deb and deployed on all root clients hosts at WMF. [14:53:52] 10DBA, 10Google-Summer-of-Code (2020), 10Patch-For-Review: GSoC 2020 Proposal: Improve the framework to transfer files over the LAN - https://phabricator.wikimedia.org/T248256 (10jcrespo) 05Open→03Resolved Final report from both student and mentor was released, successfully: https://medium.com/@ajupazham... [14:56:19] 10DBA: transferpy --checksum wrongly output `checksums do not match` message - https://phabricator.wikimedia.org/T256755 (10jcrespo) p:05Triage→03Low This is low priority, but a valid bug- since parallel_checksum will most mostly used from now on. [14:57:52] I have deleted the GSOC2020 column from the DBA workboard (I also emptied it) [14:58:22] 10DBA: Show transfer time once successfully completed - https://phabricator.wikimedia.org/T258559 (10jcrespo) a:05Privacybatm→03None [14:58:48] 10DBA, 10Patch-For-Review: transferpy: Multiprocess the transfers - https://phabricator.wikimedia.org/T259327 (10jcrespo) a:05Privacybatm→03None [15:05:06] 10DBA, 10Google-Summer-of-Code (2020), 10Patch-For-Review: GSoC 2020 Proposal: Improve the framework to transfer files over the LAN - https://phabricator.wikimedia.org/T248256 (10Marostegui) Congratulations to you both! [15:13:18] 10DBA, 10Epic, 10Patch-For-Review: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562 (10jcrespo) [15:28:56] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [15:33:01] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10jcrespo) I made a 6TB copy to discard it was the large sizes (not tested a lot for practical reasons) and it worked ok: ` root@cumin1001:~$ transfer.py backup1002.eqiad.wmnet:/srv/backups/dumps/latest/dump.... [15:36:00] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [15:46:04] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [15:48:36] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): enwiki database replicas (Toolforge and Cloud VPS) are more than 24h+ lagged - https://phabricator.wikimedia.org/T262239 (10JJMC89) [16:39:20] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) I upgrade the the Network firmware on the host [16:39:53] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) Thank you Papaul. I have upgraded mariadb and will start repooling the host tomorrow [16:49:22] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [17:04:46] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: Wed, Sept 9 PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [17:08:35] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: Wed, Sept 9 PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10wiki_willy) Update: Due to the accident/injury at the data center today, @Jclark-ctr will try and complete the upgrade of these 2x PDUs tomorrow (on Thur, Sept 10... [17:24:27] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): enwiki database replicas (Toolforge and Cloud VPS) are more than 24h+ lagged - https://phabricator.wikimedia.org/T262239 (10bd808) p:05Unbreak!→03High Lowering from UBN! to high. This is a real issue, but setting UBN! on a best effort support servic... [17:48:59] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: Wed, Sept 9 PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) [17:49:22] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10RobH) [18:11:50] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [18:13:08] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) [18:15:56] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) ` [edit interfaces interface-range vlan-private1-c-codfw] member ge-3/0/9 { ... } + member ge-3/0/10; [edit interfaces interface-r... [18:16:13] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) [18:41:34] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) [19:24:27] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2141.codfw.wmnet... [19:31:57] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) ` │ │ reuse-parts: Recipe device matching failed │... [19:51:09] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) @Marostegui problem with partman recipe. Can you please check. Thanks [20:56:55] 10DBA, 10Data-Services, 10Platform Team Initiatives (API Gateway), 10cloud-services-team (Kanban): Prepare and check storage layer for api.wikimedia.org - https://phabricator.wikimedia.org/T246946 (10nskaggs) 05Open→03Resolved It seems some of the views were already created, but I double checked things... [21:04:30] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for jawikivoyage - https://phabricator.wikimedia.org/T260482 (10nskaggs) 05Open→03Resolved a:03nskaggs Views created. [22:15:48] 10DBA, 10MediaWiki-extensions-FlaggedRevs, 10User-DannyS712: flaggedrevs_promote has entries for non existent users - https://phabricator.wikimedia.org/T262486 (10DannyS712)