Often, when working with apache spark in an airgapped environment, i need to be able to quickly add a maven artifact as dependency for a spark job i’m working on. Since my installation is airgapped, i can’t use –packages which would resolve the packages from the internet.

So i created a small python script that does two things (it uses maven):

  • it downloads all jars for a certain artifact to the current working directory

  • it creates an assembly jar (= 1 single big fat jar that contains the artifact and all its dependencies)

The script can be called like this:

python3 download_artifact.py org.apache.hadoop:hadoop-aws:3.2.0

This will download a lot of jars to the current working directory, but also a file called target/my-app-1.0-SNAPSHOT-jar-with-dependencies.jar which is the fat jar.

and the source code goes as follows (download_artifact.py)

#!/usr/bin/python3

def write_pom(artifacts):
  preamble = """
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
 
  <groupId>com.mycompany.app</groupId>
  <artifactId>my-app</artifactId>
  <version>1.0-SNAPSHOT</version>
  <build>
      <plugins>
      <plugin>
        <!-- NOTE: We don't need a groupId specification because the group is
             org.apache.maven.plugins ...which is assumed by default.
         -->
        <artifactId>maven-assembly-plugin</artifactId>
        <version>3.3.0</version>
        <configuration>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
        </configuration>
        <executions>
          <execution>
            <id>make-assembly</id> <!-- this is used for inheritance merges -->
            <phase>package</phase> <!-- bind to the packaging phase -->
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
      </plugins>
  </build>
  
  <dependencies>
  """
  postamble = """
  </dependencies>
  </project>
  """
  deps = ""
  for a in artifacts:
      descriptor = a.split(":")
      deps += "<dependency><groupId>" + descriptor[0] + "</groupId><artifactId>" + descriptor[1] + "</artifactId><version>" + ":".join(descriptor[2:]) + "</version></dependency>"
  import os
  if os.path.exists("pom.xml"):
      raise Exception("pom.xml already exists")
  with open("pom.xml", "w") as f:
      f.write(preamble + deps + postamble)

def run_maven():
    import os
    os.system("mvn dependency:copy-dependencies -DoutputDirectory=./")
    os.system("mvn package -DoutputDirectory=./")

def remove_pom():
    import os
    os.unlink("pom.xml")

import sys
write_pom([sys.argv[1]])
run_maven()
remove_pom()