PQ AccessDeniedException on Windows in checkpoint move #12345

colinsurprenant · 2020-10-14T20:02:39Z

The PQ checkpoint strategy has been modified to write to a temporary file and then atomically moved to the final checkpoint file name.

On Windows but under some unknown specific circumstances, an AccessDeniedException is thrown by the move operation. This problem seems to happen when using layered filesystems such as with NAS and/or when running in containers.

#10234 introduced the queue.checkpoint.retry to mitigate that problem by catching that exception and retrying the operation.

We still have two problem with this:

Unfortunately the root cause is not yet understood.
The mitigation sometimes does not work.

On the later point, I think we should revisit the queue.checkpoint.retry strategy which does a single retry and does a sleep of 500ms before retrying.

Until we figure the root cause for this I think we should improve that strategy by having multiple (configurable?) retries using a backoff policy (exponential?). Start with a much lower sleep and increase until we reach a retry limit.

The text was updated successfully, but these errors were encountered:

jugsofbeer · 2020-10-28T19:54:09Z

I have this exact issue in production re-occurring randomly while running Logstash v7.9.3 with queue.checkpoint.retry : true

In my case i've seen a 30 second delay on some temporary files before.

usual suspects anti-virus exclusion policies not 100% staying away from the files ?
virtual server supplied disks might be causing subtle sync issues, 1 -3 seconds randomly ?
configurable number of re-tries would help
configurable time delay value would help
configurable (optional) catch-all behavior if re-tries fails to terminate pipeline and restart pipeline with persisted disk enabled
configurable (optional) catch-all behavior if re-tries fails to terminate pipeline and restart pipeline with memory enabled ( even if persisted was specified ) ... we want the pipeline running!

colinsurprenant · 2020-10-28T20:47:04Z

@jugsofbeer Thanks for the feedback.

usual suspects anti-virus exclusion policies not 100% staying away from the files ?

Could be but AFAIK for the cases I saw reported a suspected anti-virus was disabled and problem persisted.

virtual server supplied disks might be causing subtle sync issues, 1 -3 seconds randomly ?

This is so far my hypothesis that we are dealing with some Windows filesystem layering (for virtual servers and/or NAS) which causes some race condition but this is unconfirmed.

colinsurprenant · 2020-10-28T20:52:38Z

FWIW here are a few similar reported issues on the internets:

This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure. Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows. Fixed: #12345

This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure. Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows. Fixed: elastic#12345

This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure. Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows. Fixed: #12345 (cherry picked from commit 1a5030b)

This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure. Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows. Fixed: #12345 (cherry picked from commit 1a5030b) Co-authored-by: kaisecheng <69120390+kaisecheng@users.noreply.github.com>

sharonhac · 2022-05-01T12:03:11Z

Hi , we have updated to Logstash 7.17.3 and still getting the Error writing checkpoint: java.nio.file.AccessDeniedException
is there some flag / property that we need to add ?
We're running it on Windows Server
thanks

kaisecheng · 2022-05-01T13:58:26Z

No. Logstash mitigates the issue by retrying with backoff by default. The retry is not infinite. This issue is caused by antivirus or even a file explorer, which could lock the file. Retrying indefinitely or a configurable number of retries is not under consideration as it hides the underlying problem and gives too much burden for users to config.

I consider giving a higher number of retries which finish around 3 seconds. Another we could think of is to restart the pipeline instead of shutdown.

sharonhac · 2022-05-08T06:16:55Z

Hi , do you have estimate on a new fix for it ? should it be included in a new release ?
thanks

kaisecheng · 2022-05-08T12:27:32Z

~~A plan is to retry for 3 seconds in the next release.~~ Please note that if any application locks the file over the period, the same exception is thrown. One idea to get rid of it is to set exclusion rules on folders in Antivirus.
The checkpoint locates in data folder, which is $LOGSTASH_HOME/data or find the path here https://www.elastic.co/guide/en/logstash/current/dir-layout.html

Edit: We need more feedback or telemetry to support the decision of 3 seconds retry instead of blindly make it longer.

sharonhac · 2022-05-08T13:06:07Z

We have put the antivirus to be not enabled and still the issue reproduced ,
by next release you refer to the next Logstash 7.* ?
the issue affects our Production environment and getting reproduced for various pipelines after fresh reset , the data is not usable due to it as it's not streamed correctly ...
We'd like to get the suggested fix to the latest 7.* version if possible ...
Can you tell if we put the persistence queue - queue type to Mem instead of disk - can that make a difference ?
Thanks

kaisecheng · 2022-05-09T14:24:32Z

The exception cannot be solved in Logstash alone. One suggestion is to find out which process in Windows locks the file. We don't have a timetable for further investigation. It would be great if you can provide minimum pipelines and environment information to reproduce the problem for future investigation.

That's said, if there is any improvement this year, it will be in 7.x. I will keep this issue up-to-date.

Use queue.type: memory can avoid this issue as it won't touch the disk, but it will lose at least one delivery. ~~You may consider queue.drain: true to drain the events in memory queue before shutdown.~~ Edit: by default, Logstash will try to drain the memory queue before shutdown.

reuvenik · 2022-05-09T18:12:31Z

Hi.
We have the same issue. When running the debug on one pipeline only and sending diagnostics, it did not give any clue to the support on the root cause.
We used the Handle application in loop to catch locked objects while Logstash was running. Unfortunately, we had 3 times the Handle run at the same second as the error of the access denied appeared, but nothing in the Handle results did not appear to lock any file.
So we have no clue how to find out which process is locking these objects. When we look at OpenedFilesView application, the only thing we could see access to the data folder in logstash is java.exe, so again we are at a dead end on finding what program can make this access denied issue.

Any suggestions on how to check deeper?

What did you mean by "but it will lose at least one delivery"?
Another question... I looked at the documentation and it seems like the drain option is only for persistent and not for memory queue... are you sure it can be used in memory queue to avoid data loss?

Thanks.

kaisecheng · 2022-05-09T22:04:27Z

@reuvenik Thanks for pointing out the drain option! For memory queue, the drain mechanism is enabled by default. Logstash try to drain the events before shutdown. As events are stored in memory, data loss happens if the host crashes, hence no guarantee for delivery.

I am far from an expert on the Windows platform, no new idea in mind. So far we confirm antivirus and file explorer can lock the file. I am curious if you can see Handle showing file explorer when it triggers the exception. If you believe there is no process locking the file, you can share your pipeline, config, env setup here to help us reproduce the issue.

reuvenik · 2022-05-10T19:00:24Z

Hi @kaisecheng
If stopping the service will not make data loss because of the drain option, this is great relieve. So we will lose data only if the server crushes, correct?

Before I ran the handle in loop, I tested having explorer opened in the data folder and handle captured it.
So when I did my test, I also tested having the explorer opened in the data folder and it was captured.
But in all 3 times the access denied appeared, the handle ran in the same second and nothing captured.
This and the fact that OpenedFilesView shows only java.exe program locking the queue folder making me suspect the Logstash is locking itself. We have 3 logstash servers, but maybe because they listen to the same pipelines (lets say Kafka topic), maybe the synchronization between the servers is what causing the lock. Or maybe this input can make you get to the correct conclusion.

In any case, sending the pipeline configuration is not relevant because it happened in what seems to be random pipeline each time. And it happened for pipelines with different source technology (kafka, beats, oracle, event hub, jms...). So I guess the problem is not in the pipeline. It even happened for pipeline which is getting feed from other pipeline and sets the index name in the elasticsearch output.

kaisecheng · 2022-05-10T22:13:44Z

@reuvenik Thanks for the input.

So we will lose data only if the server crushes, correct?

This page highlight the key points. In addition to the listed limitation, keep in mind that network connectivity issue or bug could happen in plugins. Admin might want to restart the process. In such case, without a persistent queue, events lose in memory.

prakashguru · 2022-08-06T09:37:40Z

@kaisecheng we are also encountering this issue but the strange part was it happens in only one environment and that is also random.

Checked the past bugs, Found there were two tests removed (ignored) citing they are failing in windows . Surprisingly the exception was "java.lang.IllegalStateException: java.nio.file.AccessDeniedException: C:\Users\jenkins.LOGSTASH-CI-WIN\AppData\Local\Temp\junit1468481081486511579\data\checkpoint.head.tmp -> C:\Users\jenkins.LOGSTASH-CI-WIN\AppData\Local\Temp\junit1468481081486511579\data\checkpoint.head"

#9918

By any chance a race condition leads to this type of error in PQ checkpoints?

prakashguru · 2022-08-06T09:55:33Z

I try to investigate the possible causes to for the AccessDeniedException on file move. And found the following code throws AccessDeniedException.

try {
    var sourceFilename = Path.of("c:\\temp\\source.txt");
    var targetFilename =  Path.of("c:\\temp\\target.txt");

    if( Files.exists(sourceFilename) )
        Files.delete(sourceFilename);

    if( Files.exists(targetFilename) )
        Files.delete(targetFilename);

    // Create source and target file.
    Files.writeString(sourceFilename, "source file");
    Files.writeString(targetFilename, "target file");

    // Open a byte channel for the target file but dont close it.
    SeekableByteChannel sbc = Files.newByteChannel(targetFilename);

    // Try to move the source file to target which is opened. This line throws AccessDeniedException
    Files.move(sourceFilename, targetFilename, StandardCopyOption.ATOMIC_MOVE);

} catch (Exception exception) {
    System.out.println(exception.toString());
}

I try to move the file, where a handle is already open (but not closed) for the target file. Simulating a case where checkpoint head write() happens while a checkpoint read was in progress.

Is this a valid case for PQ checkpoints read and write?

prakashguru · 2022-09-01T08:05:58Z

@kaisecheng Some people encountered this bug even after the fix may be due to a buggy implementation of the settings and the retry never kicked in!

Please refer: #14486

prakashguru · 2022-09-01T08:08:02Z

Hi , we have updated to Logstash 7.17.3 and still getting the Error writing checkpoint: java.nio.file.AccessDeniedException is there some flag / property that we need to add ? We're running it on Windows Server thanks

@sharonhac The retry implementation have not kicked in due to another bug. Please refer #14486

sharonhac · 2022-09-01T09:06:00Z

Hi , as @reuvenik suggested - using queue.type: memory is a workaround for the issue as the data isn't being saved to the disk , and doesn't cause the lock issue ,
we haven't tried the retry mechanism , with the flag that you suggested

kaisecheng · 2022-09-01T09:47:50Z

@prakashguru many thanks for creating the issue and keep trying to solve the Windows issue. It is indeed a buggy implementation in SettingImpl 🤦 which disable the retry for years. We have seen retry log in old version probably in v6 but it seems to be disabled starting from v7. Thanks again, will try to get a fix up and merge next week

michael-o · 2023-02-22T13:02:14Z

Identical issue: https://issues.apache.org/jira/browse/MRESOLVER-325

colinsurprenant added bug discuss persistent queues windows labels Oct 14, 2020

roaksoax added the int-shortlist label Feb 3, 2021

kaisecheng self-assigned this Feb 9, 2022

kaisecheng mentioned this issue Mar 1, 2022

[META] 8.x PQ improvements #11447

Open

32 tasks

kaisecheng added the status:work-in-progress label Mar 15, 2022

kaisecheng mentioned this issue Mar 17, 2022

add backoff to checkpoint write #13902

Merged

5 tasks

kaisecheng closed this as completed in #13902 Mar 24, 2022

roaksoax removed the status:work-in-progress label Mar 24, 2022

github-actions bot mentioned this issue Mar 24, 2022

Backport PR #13902 to 8.1: add backoff to checkpoint write #13930

Merged

5 tasks

github-actions bot mentioned this issue Mar 24, 2022

Backport PR #13902 to 7.17: add backoff to checkpoint write #13931

Merged

5 tasks

kaisecheng reopened this May 1, 2022

kaisecheng mentioned this issue Sep 1, 2022

fix SettingsImpl to remove hardcoded checkpointRetry #14487

Merged

7 tasks

github-actions bot mentioned this issue Sep 5, 2022

Backport PR #14487 to 7.17: fix SettingsImpl to remove hardcoded checkpointRetry #14494

Merged

7 tasks

robbavey mentioned this issue Aug 10, 2023

Add troubleshooting info warning about use of antivirus on Windows boxes #15248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PQ AccessDeniedException on Windows in checkpoint move #12345

PQ AccessDeniedException on Windows in checkpoint move #12345

colinsurprenant commented Oct 14, 2020

jugsofbeer commented Oct 28, 2020 •

edited

Loading

colinsurprenant commented Oct 28, 2020

colinsurprenant commented Oct 28, 2020

sharonhac commented May 1, 2022

kaisecheng commented May 1, 2022

sharonhac commented May 8, 2022

kaisecheng commented May 8, 2022 •

edited

Loading

sharonhac commented May 8, 2022

kaisecheng commented May 9, 2022 •

edited

Loading

reuvenik commented May 9, 2022

kaisecheng commented May 9, 2022

reuvenik commented May 10, 2022

kaisecheng commented May 10, 2022

prakashguru commented Aug 6, 2022

prakashguru commented Aug 6, 2022

prakashguru commented Sep 1, 2022

prakashguru commented Sep 1, 2022

sharonhac commented Sep 1, 2022

kaisecheng commented Sep 1, 2022

michael-o commented Feb 22, 2023

PQ AccessDeniedException on Windows in checkpoint move #12345

PQ AccessDeniedException on Windows in checkpoint move #12345

Comments

colinsurprenant commented Oct 14, 2020

jugsofbeer commented Oct 28, 2020 • edited Loading

colinsurprenant commented Oct 28, 2020

colinsurprenant commented Oct 28, 2020

sharonhac commented May 1, 2022

kaisecheng commented May 1, 2022

sharonhac commented May 8, 2022

kaisecheng commented May 8, 2022 • edited Loading

sharonhac commented May 8, 2022

kaisecheng commented May 9, 2022 • edited Loading

reuvenik commented May 9, 2022

kaisecheng commented May 9, 2022

reuvenik commented May 10, 2022

kaisecheng commented May 10, 2022

prakashguru commented Aug 6, 2022

prakashguru commented Aug 6, 2022

prakashguru commented Sep 1, 2022

prakashguru commented Sep 1, 2022

sharonhac commented Sep 1, 2022

kaisecheng commented Sep 1, 2022

michael-o commented Feb 22, 2023

jugsofbeer commented Oct 28, 2020 •

edited

Loading

kaisecheng commented May 8, 2022 •

edited

Loading

kaisecheng commented May 9, 2022 •

edited

Loading