Using parallel with multiple arguments

The beauty of parallel

By 3manuek in Bash Parallel Utilities

April 1, 2024

Using parallel with multiple arguments

The example case

One of the projects I’ve been working on lately, is about building OCI images with distroless components. The amount of generated images is considerably high, counting as of today about 2k images pushed. Let’s say that they are layers instead of images by themselves, as there is a chain dependency to make a final functional image.

The problem was how I could check the image information across the Github API. So, I build a script that went through all the existing containers (that’s how GH calls the images) across several pages.

My initial take was simple: extract the name of the image, the id , parse and do variable substitution for scrapping and storing the image information. The whole process was taking 12 minutes to run, something was definitively odd. I used Codeium to refactor the code and the produced output didn’t convinced me that much, as it was a complex version of the old code.

I used parallel in the past, but this case was different, as arguments were not a combination, they were rows of information of each image version.

The full execution with parallel took around 6 minutes to run, and consider that I stick to just 4 jobs at a time – I was on the edge of the API quota.

Addressing the problem using JSON and parallel

The below snippet is a part of a function called index_container_pages which is a simple extraction of the needed information to scrape image version from the API.

1
  jq -r  '.[] | {id: .id , name: .name , url: .url} | @json' $(get_container_pages) > ${OUTDIR}/paramix.json

This generated file is the argument list that we are going to use forward to parametrize parallel.

The code inside the script, ended up looking like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
  HEADER="curl -s -L -H \"Accept: application/vnd.github+json\" \
        -H \"Authorization: Bearer ${GH_TOKEN}\" \
        -H \"X-GitHub-Api-Version: 2022-11-28\""
   
  parallel --progress --colsep '\t' -j4 \
    '[ ! -d '''${OUTDIR}'''/{2} ] && { mkdir -p '''${OUTDIR}'''/{2} ; } ; \
    '''${HEADER}''' \
        {1} > '''${OUTDIR}'''/{2}/{3}.json ; sleep 0.1' \
     :::: <(jq -r '. | "\(.url)\t\(.name)\t\(.id)"' ${OUTDIR}/paramix.json) 

  parallel --progress --colsep '\t' -j4 \
    ''''${HEADER}''' \
        {1}/versions > '''${OUTDIR}'''/{2}/versions.json ; sleep 0.1' \
     :::: <(jq -r '. | "\(.url)\t\(.name)"' ${OUTDIR}/paramix.json) 
  • HEADER is just a macro of the curl command.
  • Those variables using triple single quotes are those “constants”, so we inject them directly into the parallel command.
  • The :::: is the parallel equivalent of while, and we read the parameters from the paramix.json in TSV format. You can use CSV too. In any case, this is controlled by --colsep '\t'.
  • The {1}, {2} and {3} are the variables that we feed from the jq command.
  • The sleep isn’t strictly necessary, but it controls better the case in which the iteration could be considered suspicious.
  • The name, returns the name path of the image, so we use that exact name to create the local path.

You can even do better controling the parallelization with the -j flag by getting the factor of processing that you want to assign. Eg.:

1
2
-j $(expr $(nproc) / 2 + 1)
-j $(nproc)

Keep in mind that there are certain limitation regarding the amount of permitted requests across any API. Although, the above example can be used to process things locally.

Other combinations

The parallel arguments can be basically controlled by :::, :::+ and :::: (which is the one we used above for taking arguments from a file). The ::: just combines all the arguments, whether :::+ forces a single execution for each argument.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
parallel echo ::: 1 2 3 ::: A B C ::: T
1 A T
1 B T
1 C T
2 A T
2 B T
2 C T
3 A T
3 B T
3 C T

Now, suppose that I just want the combinations for A B C only, so I can use :::+ to do a single iteration:

1
2
3
4
5
6
7
8
9
parallel echo ::: 1 2 3 :::+ A B C ::: T
1 A T
2 B T
3 C T

parallel echo ::: 1 2 3 ::: A B C :::+ T
1 A T
2 A T
3 A T

The rightmost arguments do have precedence over the leftmost arguments, so keep this in mind when building argument lists.

Thanks for reading!

Posted on:
April 1, 2024
Length:
4 minute read, 754 words
Categories:
Bash Parallel Utilities
Series:
Bash Parallel
Tags:
hugo-site
See Also:
Pool exhaustion laboratory
Ergodox Keyboard Layout for Colemak, QWERTY and Dvorak
Open Labs