Job arrays #3892

bentsherman · 2023-04-21T15:49:24Z

Closes #1477 (and possibly #1427)

Summary of changes:

Adds array directive to submit tasks as array jobs of a given size
TaskArrayCollector collects tasks into arrays and submits each array job when it is ready to the underlying executor. The executor must implement the TaskArrayAware interface. Each process has its own array job collector.

When all input channels to a process have received the "poison pill", the process is "closed" and the array job collector is notified so that it can submit any remaining tasks. All subsequent tasks (e.g. retries) will be submitted as individual tasks.
TaskArray is a special type of TaskRun for an array job that holds the list of child task handlers. For an executor that supports array jobs, the task handler can check whether its task is a TaskArray to apply perform job specific behavior.
TaskHandler has a few more methods, which the array job collector uses to create the array job script. This script simply defines the list of child work directories, selects a work dir based on the index, and launches the child task using an executor-specific launch command.
TaskPollingMonitor has been modified to handle both array jobs and regular tasks. The array job is handled like any other task, but then discarded once it has been submitted. The task stats reported by Nextflow are the same with or without array jobs -- array jobs are not included in the task stats.

Here's the pipeline I'm using as the e2e test:

params.n_tasks = 50

process foo {
    array 10

    input: val index
    output: path 'output.txt'

    """
    echo "Hello from task ${index}!" > output.txt
    """
}

process bar {
    debug true
    array 10

    input: path 'input.txt'

    """
    cat input.txt
    """
}

workflow {
    Channel.of(1 .. params.n_tasks) | foo | bar
}

TODO:

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman · 2023-04-21T18:24:56Z

If a task fails and is retried with increased resources, it will be batched with other tasks that may still be on their first attempt. In that case, the array job resources will depend on whichever tasks happens to be first in the batch.

One solution is to take the max value of cpus, memory, time, etc for all tasks in an array job. That would be "safe" but likely much more expensive -- if a single task requests twice the resources, suddenly the entire array job does as well.

Another solution is to further separate batches by configuration, to ensure that they are uniform. We could go crazy and separate batches by the tuple of (cpus, memory, time, ...), but I think that would be overkill. Better I think to just split based on attempt and tell users to "handle with care".

bentsherman · 2023-04-21T18:30:21Z

We could also just provide config options for these things:

executor.$array.groupKeys (default: ['process', 'attempt']) controls how batches are separated
executor.$array.requestMaxResources controls whether the array executor "plays if safe" by taking the max resources across all tasks in an array

bentsherman · 2023-04-21T18:33:29Z

The point of these config options is that there is a trade-off between bandwidth and latency when batching tasks like this, so users should ideally have the ability to manage that trade-off in a way that best fits their use case. If someone doesn't use retry with dynamic resources, then they don't need to group by attempt, and vise versa.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

modules/nextflow/src/main/groovy/nextflow/executor/ArrayExecutor.groovy

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

plugins/nf-amazon/src/main/nextflow/cloud/aws/batch/AwsBatchExecutor.groovy

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

pditommaso · 2024-05-07T21:25:31Z

umm, this still shows

Uploading local `bin` scripts folder to az://my-data/work/tmp/ef/5754c0af6a3622dadd5bab803c316c/bin
Monitor the execution with Seqera Platform using this URL: https://cloud.seqera.io/user/pditommaso/watch/3NQ9I51JIGRbX4
Error: Exception in thread "tower-logs-checkpoint" java.lang.NullPointerException: Cannot invoke "java.lang.Thread.isInterrupted()" because "this.thread" is null
	at io.seqera.tower.plugin.LogsCheckpoint.run(LogsCheckpoint.groovy:70)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at groovy.lang.MetaClassImpl.invokeMethodClosure(MetaClassImpl.java:1017)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1207)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at groovy.lang.Closure.call(Closure.java:433)
	at groovy.lang.Closure.call(Closure.java:412)
	at groovy.lang.Closure.run(Closure.java:[505](https://github.com/nextflow-io/nextflow/actions/runs/8991243665/job/24698707803#step:5:506))
	at java.base/java.lang.VirtualThread.run(VirtualThread.java:309)

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman · 2024-05-07T21:35:28Z

Yeah I didn't think the volatile would help. Really there is no point in checking if the current thread is interrupted...

pditommaso · 2024-05-08T08:50:36Z

That's the recommended pattern by Java Java Concurrency in Practice bible

This reverts commit 247b721.

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

pditommaso · 2024-05-08T10:49:06Z

Reverted 247b721 because the problem comes from the fact the thread starts before the variable is assigned.

The bottom line is that this code is faulty

pditommaso · 2024-05-08T10:58:28Z

@bentsherman all solved with Google Batch logs and child ids?

bentsherman · 2024-05-08T13:01:08Z

Google Batch logs are working

Why would you check if the thread you are currently in is interrupted? If it's interrupted then you wouldn't get the chance to check if it's interrupted, you would just be interrupted...

pditommaso · 2024-05-08T13:17:25Z

Page 105 https://raw.githubusercontent.com/wususu/effective-resourses/master/Java/Java%20Concurrency%20in%20Practice.pdf

Anyhow, it is worth covering in a separate PR

bentsherman · 2024-05-08T13:26:42Z

Should be solved with: !thread?.isInterrupted()

pditommaso · 2024-05-08T14:21:42Z

think so

pditommaso · 2024-05-09T08:57:07Z

Fascinating the dir content when using job array (only for nerds :))

» tree work/
work/
├── 17
│   └── 5ee762d51d42993b304ec32c2ac69e
├── 38
│   └── 30315d77627dd73e5d2c31b92c8d2b
│       ├── slurm-8_0.out
│       └── slurm-8_1.out
├── 46
│   └── 59224af77ca4b3daf702a4507ffed8
├── bc
│   └── 15aecb161e4bb851a7f7ff27b6b728
├── d0
│   └── 43bb64e2b8b5e78147e93894e495d7
└── e2
    └── 119acedcd1ca2d851bc3abd00ae99a
        ├── slurm-6_0.out
        └── slurm-6_1.out

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

pditommaso · 2024-05-09T10:06:58Z

Reverted some renaming on launch <> submit methods because still there were some inconsistencies and to prevent breaking the xpack deps

pditommaso · 2024-05-09T10:07:57Z

Think we are finally ready to merge this. Great effort 👏 👏

Job array is a capability provided by some batch schedulers that allows spawning the execution of multiple copies of the same job in an efficient manner. Nextflow allows the use of this capability by setting the process directive `array <some value>` that determines the (max) number of jobs in the array. For example ``` process foo { array 10 ''' your_task ''' } ``` or in the nextflow config file ``` process.array = 10 ``` Currently this feature is supported by the following executors: * Slurm * Sge * Pbs * Pbs Pro * LSF * AWS Batch * Google Batch Signed-off-by: Ben Sherman <bentshermann@gmail.com> Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com> Signed-off-by: Mahesh Binzer-Panchal <mahesh.binzer-panchal@nbis.se> Signed-off-by: Herman Singh <herman@massmatrix.bio> Signed-off-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com> Co-authored-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com> Co-authored-by: Abhinav Sharma <abhi18av@users.noreply.github.com> Co-authored-by: Mahesh Binzer-Panchal <mahesh.binzer-panchal@nbis.se> Co-authored-by: Herman Singh <kartstig@gmail.com> Co-authored-by: Dr Marco Claudio De La Pierre <marco.delapierre@gmail.com>

bentsherman added 2 commits April 21, 2023 10:05

Add initial array executor

41ac662

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Add support for SLURM array jobs

a0900ae

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman added executor/lsf executor/sge executor/slurm labels Apr 21, 2023

bentsherman requested a review from pditommaso April 21, 2023 15:49

This comment was marked as outdated.

Sign in to view

Fix failing test

b5bad00

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

This comment was marked as outdated.

Sign in to view

bentsherman added 2 commits April 21, 2023 14:54

Document array executor

713e33e

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Use concurrent queues in array executor, add fallback for leftover tasks

76de0d4

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

This comment was marked as outdated.

Sign in to view

bentsherman commented Apr 21, 2023

View reviewed changes

modules/nextflow/src/main/groovy/nextflow/executor/ArrayExecutor.groovy Outdated Show resolved Hide resolved

bentsherman added 3 commits April 21, 2023 16:38

Move batching logic to array task polling monitor

1ef005c

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Cache status checks in array task handler

27728eb

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Finalize support for SLURM array jobs

343b331

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

This comment was marked as outdated.

Sign in to view

Merge branch 'master' into 1477-job-array-executor [ci skip]

a70e6b4

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

This comment was marked as outdated.

Sign in to view

Refactor array executor as process directive

e5ca280

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

This comment was marked as outdated.

Sign in to view

Move misplaced test

a77cfd5

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

pditommaso reviewed May 6, 2024

View reviewed changes

plugins/nf-amazon/src/main/nextflow/cloud/aws/batch/AwsBatchExecutor.groovy Outdated Show resolved Hide resolved

pditommaso reviewed May 6, 2024

View reviewed changes

plugins/nf-amazon/src/main/nextflow/cloud/aws/batch/AwsBatchExecutor.groovy Outdated Show resolved Hide resolved

pditommaso added 2 commits May 7, 2024 12:10

Strengthen aws batch deletion logi

4e41436

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Strengthen google batch deletion logic

e8b701d

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

pditommaso reviewed May 7, 2024

View reviewed changes

plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy Show resolved Hide resolved

pditommaso and others added 3 commits May 7, 2024 21:33

Merge branch 'master' into 1477-job-array-executor

3291f6c

minor edits

b7ce0a1

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Update google batch logging to select task logs

3e29adc

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Fix race condition in LogsCheckpoint

247b721

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

pditommaso added 2 commits May 8, 2024 11:01

Revert "Fix race condition in LogsCheckpoint"

e086674

This reverts commit 247b721.

Resolve conflicts

cbbbc46

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

pditommaso added 3 commits May 9, 2024 10:57

Restore grid handler names [ci fast]

6b8bb49

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Revert submit naming

912f90d

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Revert volatile change

4b45aa0

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

pditommaso merged commit ca9bc9d into master May 9, 2024
22 checks passed

pditommaso deleted the 1477-job-array-executor branch May 9, 2024 10:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job arrays #3892

Job arrays #3892

bentsherman commented Apr 21, 2023 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

bentsherman commented Apr 21, 2023

bentsherman commented Apr 21, 2023 •

edited

Loading

bentsherman commented Apr 21, 2023

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

pditommaso commented May 7, 2024

bentsherman commented May 7, 2024

pditommaso commented May 8, 2024

pditommaso commented May 8, 2024

pditommaso commented May 8, 2024

bentsherman commented May 8, 2024

pditommaso commented May 8, 2024

bentsherman commented May 8, 2024

pditommaso commented May 8, 2024

pditommaso commented May 9, 2024

pditommaso commented May 9, 2024

pditommaso commented May 9, 2024

Job arrays #3892

Job arrays #3892

Conversation

bentsherman commented Apr 21, 2023 • edited Loading

This comment was marked as outdated.

This comment was marked as outdated.

bentsherman commented Apr 21, 2023

bentsherman commented Apr 21, 2023 • edited Loading

bentsherman commented Apr 21, 2023

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

pditommaso commented May 7, 2024

bentsherman commented May 7, 2024

pditommaso commented May 8, 2024

pditommaso commented May 8, 2024

pditommaso commented May 8, 2024

bentsherman commented May 8, 2024

pditommaso commented May 8, 2024

bentsherman commented May 8, 2024

pditommaso commented May 8, 2024

pditommaso commented May 9, 2024

pditommaso commented May 9, 2024

pditommaso commented May 9, 2024

bentsherman commented Apr 21, 2023 •

edited

Loading

bentsherman commented Apr 21, 2023 •

edited

Loading