Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PQ AccessDeniedException on Windows in checkpoint move #12345

Open
colinsurprenant opened this issue Oct 14, 2020 · 20 comments · Fixed by #13902
Open

PQ AccessDeniedException on Windows in checkpoint move #12345

colinsurprenant opened this issue Oct 14, 2020 · 20 comments · Fixed by #13902

Comments

@colinsurprenant
Copy link
Contributor

The PQ checkpoint strategy has been modified to write to a temporary file and then atomically moved to the final checkpoint file name.

On Windows but under some unknown specific circumstances, an AccessDeniedException is thrown by the move operation. This problem seems to happen when using layered filesystems such as with NAS and/or when running in containers.

#10234 introduced the queue.checkpoint.retry to mitigate that problem by catching that exception and retrying the operation.

We still have two problem with this:

  • Unfortunately the root cause is not yet understood.

  • The mitigation sometimes does not work.

On the later point, I think we should revisit the queue.checkpoint.retry strategy which does a single retry and does a sleep of 500ms before retrying.

Until we figure the root cause for this I think we should improve that strategy by having multiple (configurable?) retries using a backoff policy (exponential?). Start with a much lower sleep and increase until we reach a retry limit.

@jugsofbeer
Copy link

jugsofbeer commented Oct 28, 2020

I have this exact issue in production re-occurring randomly while running Logstash v7.9.3 with queue.checkpoint.retry : true

In my case i've seen a 30 second delay on some temporary files before.

  • usual suspects anti-virus exclusion policies not 100% staying away from the files ?
  • virtual server supplied disks might be causing subtle sync issues, 1 -3 seconds randomly ?
  • configurable number of re-tries would help
  • configurable time delay value would help
  • configurable (optional) catch-all behavior if re-tries fails to terminate pipeline and restart pipeline with persisted disk enabled
  • configurable (optional) catch-all behavior if re-tries fails to terminate pipeline and restart pipeline with memory enabled ( even if persisted was specified ) ... we want the pipeline running!

@colinsurprenant
Copy link
Contributor Author

@jugsofbeer Thanks for the feedback.

usual suspects anti-virus exclusion policies not 100% staying away from the files ?

Could be but AFAIK for the cases I saw reported a suspected anti-virus was disabled and problem persisted.

virtual server supplied disks might be causing subtle sync issues, 1 -3 seconds randomly ?

This is so far my hypothesis that we are dealing with some Windows filesystem layering (for virtual servers and/or NAS) which causes some race condition but this is unconfirmed.

@colinsurprenant
Copy link
Contributor Author

@kaisecheng kaisecheng self-assigned this Feb 9, 2022
kaisecheng added a commit that referenced this issue Mar 24, 2022
This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure.
Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows.
Fixed: #12345
kaisecheng added a commit to kaisecheng/logstash that referenced this issue Mar 24, 2022
This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure.
Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows.
Fixed: elastic#12345
kaisecheng added a commit to kaisecheng/logstash that referenced this issue Mar 24, 2022
This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure.
Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows.
Fixed: elastic#12345
github-actions bot pushed a commit that referenced this issue Mar 24, 2022
This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure.
Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows.
Fixed: #12345

(cherry picked from commit 1a5030b)
github-actions bot pushed a commit that referenced this issue Mar 24, 2022
This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure.
Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows.
Fixed: #12345

(cherry picked from commit 1a5030b)
kaisecheng added a commit that referenced this issue Mar 24, 2022
This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure.
Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows.
Fixed: #12345

(cherry picked from commit 1a5030b)

Co-authored-by: kaisecheng <69120390+kaisecheng@users.noreply.github.com>
kaisecheng added a commit that referenced this issue Mar 24, 2022
This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure.
Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows.
Fixed: #12345

(cherry picked from commit 1a5030b)

Co-authored-by: kaisecheng <69120390+kaisecheng@users.noreply.github.com>
@sharonhac
Copy link

Hi , we have updated to Logstash 7.17.3 and still getting the Error writing checkpoint: java.nio.file.AccessDeniedException
is there some flag / property that we need to add ?
We're running it on Windows Server
thanks

@kaisecheng
Copy link
Contributor

No. Logstash mitigates the issue by retrying with backoff by default. The retry is not infinite. This issue is caused by antivirus or even a file explorer, which could lock the file. Retrying indefinitely or a configurable number of retries is not under consideration as it hides the underlying problem and gives too much burden for users to config.

I consider giving a higher number of retries which finish around 3 seconds. Another we could think of is to restart the pipeline instead of shutdown.

@kaisecheng kaisecheng reopened this May 1, 2022
@sharonhac
Copy link

Hi , do you have estimate on a new fix for it ? should it be included in a new release ?
thanks

@kaisecheng
Copy link
Contributor

kaisecheng commented May 8, 2022

A plan is to retry for 3 seconds in the next release. Please note that if any application locks the file over the period, the same exception is thrown. One idea to get rid of it is to set exclusion rules on folders in Antivirus.
The checkpoint locates in data folder, which is $LOGSTASH_HOME/data or find the path here https://www.elastic.co/guide/en/logstash/current/dir-layout.html

Edit: We need more feedback or telemetry to support the decision of 3 seconds retry instead of blindly make it longer.

@sharonhac
Copy link

We have put the antivirus to be not enabled and still the issue reproduced ,
by next release you refer to the next Logstash 7.* ?
the issue affects our Production environment and getting reproduced for various pipelines after fresh reset , the data is not usable due to it as it's not streamed correctly ...
We'd like to get the suggested fix to the latest 7.* version if possible ...
Can you tell if we put the persistence queue - queue type to Mem instead of disk - can that make a difference ?
Thanks

@kaisecheng
Copy link
Contributor

kaisecheng commented May 9, 2022

The exception cannot be solved in Logstash alone. One suggestion is to find out which process in Windows locks the file. We don't have a timetable for further investigation. It would be great if you can provide minimum pipelines and environment information to reproduce the problem for future investigation.

That's said, if there is any improvement this year, it will be in 7.x. I will keep this issue up-to-date.

Use queue.type: memory can avoid this issue as it won't touch the disk, but it will lose at least one delivery. You may consider queue.drain: true to drain the events in memory queue before shutdown. Edit: by default, Logstash will try to drain the memory queue before shutdown.

@reuvenik
Copy link

reuvenik commented May 9, 2022

Hi.
We have the same issue. When running the debug on one pipeline only and sending diagnostics, it did not give any clue to the support on the root cause.
We used the Handle application in loop to catch locked objects while Logstash was running. Unfortunately, we had 3 times the Handle run at the same second as the error of the access denied appeared, but nothing in the Handle results did not appear to lock any file.
So we have no clue how to find out which process is locking these objects. When we look at OpenedFilesView application, the only thing we could see access to the data folder in logstash is java.exe, so again we are at a dead end on finding what program can make this access denied issue.

Any suggestions on how to check deeper?

What did you mean by "but it will lose at least one delivery"?
Another question... I looked at the documentation and it seems like the drain option is only for persistent and not for memory queue... are you sure it can be used in memory queue to avoid data loss?

Thanks.

@kaisecheng
Copy link
Contributor

@reuvenik Thanks for pointing out the drain option! For memory queue, the drain mechanism is enabled by default. Logstash try to drain the events before shutdown. As events are stored in memory, data loss happens if the host crashes, hence no guarantee for delivery.

I am far from an expert on the Windows platform, no new idea in mind. So far we confirm antivirus and file explorer can lock the file. I am curious if you can see Handle showing file explorer when it triggers the exception. If you believe there is no process locking the file, you can share your pipeline, config, env setup here to help us reproduce the issue.

@reuvenik
Copy link

Hi @kaisecheng
If stopping the service will not make data loss because of the drain option, this is great relieve. So we will lose data only if the server crushes, correct?

Before I ran the handle in loop, I tested having explorer opened in the data folder and handle captured it.
So when I did my test, I also tested having the explorer opened in the data folder and it was captured.
But in all 3 times the access denied appeared, the handle ran in the same second and nothing captured.
This and the fact that OpenedFilesView shows only java.exe program locking the queue folder making me suspect the Logstash is locking itself. We have 3 logstash servers, but maybe because they listen to the same pipelines (lets say Kafka topic), maybe the synchronization between the servers is what causing the lock. Or maybe this input can make you get to the correct conclusion.

In any case, sending the pipeline configuration is not relevant because it happened in what seems to be random pipeline each time. And it happened for pipelines with different source technology (kafka, beats, oracle, event hub, jms...). So I guess the problem is not in the pipeline. It even happened for pipeline which is getting feed from other pipeline and sets the index name in the elasticsearch output.

@kaisecheng
Copy link
Contributor

@reuvenik Thanks for the input.

So we will lose data only if the server crushes, correct?

This page highlight the key points. In addition to the listed limitation, keep in mind that network connectivity issue or bug could happen in plugins. Admin might want to restart the process. In such case, without a persistent queue, events lose in memory.

@prakashguru
Copy link

@kaisecheng we are also encountering this issue but the strange part was it happens in only one environment and that is also random.

Checked the past bugs, Found there were two tests removed (ignored) citing they are failing in windows . Surprisingly the exception was "java.lang.IllegalStateException: java.nio.file.AccessDeniedException: C:\Users\jenkins.LOGSTASH-CI-WIN\AppData\Local\Temp\junit1468481081486511579\data\checkpoint.head.tmp -> C:\Users\jenkins.LOGSTASH-CI-WIN\AppData\Local\Temp\junit1468481081486511579\data\checkpoint.head"

#9918

By any chance a race condition leads to this type of error in PQ checkpoints?

@prakashguru
Copy link

I try to investigate the possible causes to for the AccessDeniedException on file move. And found the following code throws AccessDeniedException.

try {
    var sourceFilename = Path.of("c:\\temp\\source.txt");
    var targetFilename =  Path.of("c:\\temp\\target.txt");

    if( Files.exists(sourceFilename) )
        Files.delete(sourceFilename);

    if( Files.exists(targetFilename) )
        Files.delete(targetFilename);

    // Create source and target file.
    Files.writeString(sourceFilename, "source file");
    Files.writeString(targetFilename, "target file");

    // Open a byte channel for the target file but dont close it.
    SeekableByteChannel sbc = Files.newByteChannel(targetFilename);

    // Try to move the source file to target which is opened. This line throws AccessDeniedException
    Files.move(sourceFilename, targetFilename, StandardCopyOption.ATOMIC_MOVE);

} catch (Exception exception) {
    System.out.println(exception.toString());
}

I try to move the file, where a handle is already open (but not closed) for the target file. Simulating a case where checkpoint head write() happens while a checkpoint read was in progress.

Is this a valid case for PQ checkpoints read and write?

@prakashguru
Copy link

@kaisecheng Some people encountered this bug even after the fix may be due to a buggy implementation of the settings and the retry never kicked in!

Please refer: #14486

@prakashguru
Copy link

Hi , we have updated to Logstash 7.17.3 and still getting the Error writing checkpoint: java.nio.file.AccessDeniedException is there some flag / property that we need to add ? We're running it on Windows Server thanks

@sharonhac The retry implementation have not kicked in due to another bug. Please refer #14486

@sharonhac
Copy link

Hi , as @reuvenik suggested - using queue.type: memory is a workaround for the issue as the data isn't being saved to the disk , and doesn't cause the lock issue ,
we haven't tried the retry mechanism , with the flag that you suggested

@kaisecheng
Copy link
Contributor

@prakashguru many thanks for creating the issue and keep trying to solve the Windows issue. It is indeed a buggy implementation in SettingImpl 🤦 which disable the retry for years. We have seen retry log in old version probably in v6 but it seems to be disabled starting from v7. Thanks again, will try to get a fix up and merge next week

@michael-o
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants