DEV Community

Cover image for Getting started with an open source NSA tool to construct distributed graphs
Montana Mendy
Montana Mendy

Posted on

Getting started with an open source NSA tool to construct distributed graphs

Image description

Datawave is a Java-based ingest and query framework that leverages Apache Accumulo to provide fast, secure access to your data. Datawave supports a wide variety of use cases, including but not limited to:

  • Data fusion across structured and unstructured datasets
  • Construction and analysis of distributed graphs
  • Multi-tenant data architectures, with tenants having distinct security requirements and data access patterns
  • Fine-grained control over data access, integrated easily with existing user-authorization services and PKI

Below is the basic structure of Datawave:

Image description

Here's a graphic I made that get's a little more in-depth:

Flowchart

Now that you know how the architecture flows a bit better let's start with installing Datawave, then we'll get into Edges.

Getting started with NSA's Datawave

Before we start, NB: You should have an understanding of using simple Bash scripting, Linux commands like grep awk and using piping.

What you'll need

Linux, Bash, and an Internet connection to wget tarballs (tar.gz)that you should be able to ssh to localhost without a passphrase.

Note that the quickstart Hadoop install will set up passphrase-less ssh for you automatically, unless it detects that you already have a private/public key pair generated

Familiarize yourself with swap and/or swapping if you haven't already. You'll need this. https://wiki.gentoo.org/wiki/Swap

Installing Datawave through the CLI in 5 commands

echo "source DW_SOURCE/contrib/datawave-quickstart/bin/env.sh" >> ~/.bashrc source ~/.bashrc allInstall datawaveWebStart && datawaveWebTest 
Enter fullscreen mode Exit fullscreen mode

So we're adding sources to our .bashrc this can be true too if you're using zsh.

The four commands above will complete the entire quickstart installation. However, it’s a good idea to at least skim over the sections below to get an idea of how the setup works and how to customize it for your own preferences.

To keep things simple, DataWave, Hadoop, Accumulo, ZooKeeper, and Wildfly will be installed under your DW_SOURCE/contrib/datawave-quickstart directory, and all will be owned by / executed as the current user, hence why a bash script in the background was being ran.

Overriding your default binaries

On some occasions you may need to override the default binaries (not in all machines, or setups). Let's open up Vim, and do this in case you do need to end up overriding your binaries. To override the quickstarts default version of a particular binary, simply override the desired DW_*_DIST_URI value as shown below. URIs may be local or remote. Local file URI values must be prefixed with file://, so let's start:

vi ~/.bashrc export DW_HADOOP_DIST_URI=file:///my/local/binaries/hadoop-x.y.z.tar.gz export DW_ACCUMULO_DIST_URI=http://some.apache.mirror/accumulo/1.x/accumulo-1.x-bin.tar.gz export DW_ZOOKEEPER_DIST_URI=http://some.apache.mirror/zookeeper/x.y/zookeeper-x.y.z.tar.gz export DW_WILDFLY_DIST_URI=file:///my/local/binaries/wildfly-10.x.tar.gz export DW_JAVA_DIST_URI=file:///my/local/binaries/jdk-8-update-x.tar.gz export DW_MAVEN_DIST_URI=file:///my/local/binaries/apache-maven-x.y.z.tar.gz 
Enter fullscreen mode Exit fullscreen mode

We just grabbed Apache Hadoop, Accumulo, Zookeeper, Wildfly, and Maven. Now if this seems like a lot, we can always bootstrap your environment with a bash script, this doesn't give you as much flexibility (unless you want to add it), but it is quicker, add your own shebang line at the top. Make sure you make this bash script executable, when done copy/pasting, run chmod u+x bootstrap_datawave.sh:

DW_DATAWAVE_SERVICE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" DW_DATAWAVE_SOURCE_DIR="$( cd "${DW_DATAWAVE_SERVICE_DIR}/../../../../.." && pwd )" DW_DATAWAVE_ACCUMULO_AUTHS="${DW_DATAWAVE_ACCUMULO_AUTHS:-PUBLIC,PRIVATE,FOO,BAR,DEF,A,B,C,D,E,F,G,H,I,DW_USER,DW_SERV,DW_ADMIN,JBOSS_ADMIN}" # Import DataWave Web test user configuration source "${DW_DATAWAVE_SERVICE_DIR}/bootstrap-user.sh" # Selected Maven profile for the DataWave build DW_DATAWAVE_BUILD_PROFILE=${DW_DATAWAVE_BUILD_PROFILE:-dev} # Maven command DW_DATAWAVE_BUILD_COMMAND="${DW_DATAWAVE_BUILD_COMMAND:-mvn -P${DW_DATAWAVE_BUILD_PROFILE} -Ddeploy -Dtar -Ddist -Dservices -DskipTests clean install --builder smart -T1.0C}" # Home of any temp data and *.properties file overrides for this instance of DataWave DW_DATAWAVE_DATA_DIR="${DW_CLOUD_DATA}/datawave" # Temp dir for persisting our dynamically-generated ${DW_DATAWAVE_BUILD_PROFILE}.properties file DW_DATAWAVE_BUILD_PROPERTIES_DIR="${DW_DATAWAVE_DATA_DIR}/build-properties" DW_DATAWAVE_BUILD_STATUS_LOG="${DW_DATAWAVE_BUILD_PROPERTIES_DIR}/build-progress.tmp" DW_DATAWAVE_INGEST_TARBALL="*/datawave-${DW_DATAWAVE_BUILD_PROFILE}-*-dist.tar.gz" DW_DATAWAVE_WEB_TARBALL="*/datawave-ws-deploy-application-*-${DW_DATAWAVE_BUILD_PROFILE}.tar.gz" DW_DATAWAVE_KEYSTORE="${DW_DATAWAVE_KEYSTORE:-${DW_DATAWAVE_SOURCE_DIR}/web-services/deploy/application/src/main/wildfly/overlay/standalone/configuration/certificates/testServer.p12}" DW_DATAWAVE_KEYSTORE_PASSWORD=${DW_DATAWAVE_KEYSTORE_PASSWORD:-ChangeIt} DW_DATAWAVE_KEYSTORE_TYPE="${DW_DATAWAVE_KEYSTORE_TYPE:-PKCS12}" DW_DATAWAVE_TRUSTSTORE="${DW_DATAWAVE_TRUSTSTORE:-${DW_DATAWAVE_SOURCE_DIR}/web-services/deploy/application/src/main/wildfly/overlay/standalone/configuration/certificates/ca.jks}" DW_DATAWAVE_TRUSTSTORE_PASSWORD=${DW_DATAWAVE_TRUSTSTORE_PASSWORD:-ChangeIt} DW_DATAWAVE_TRUSTSTORE_TYPE="${DW_DATAWAVE_TRUSTSTORE_TYPE:-JKS}" # Accumulo shell script for initializing whatever we may need in Accumulo for DataWave function createAccumuloShellInitScript() { # Allow user to inject their own script into the env... [ -n "${DW_ACCUMULO_SHELL_INIT_SCRIPT}" ] && return 0 # Create script and add 'datawave' VFS context, if enabled... DW_ACCUMULO_SHELL_INIT_SCRIPT=" createnamespace datawave createtable datawave.queryMetrics_m createtable datawave.queryMetrics_s setauths -s ${DW_DATAWAVE_ACCUMULO_AUTHS}" if [ "${DW_ACCUMULO_VFS_DATAWAVE_ENABLED}" != false ] ; then DW_ACCUMULO_SHELL_INIT_SCRIPT="${DW_ACCUMULO_SHELL_INIT_SCRIPT} config -s table.classpath.context=datawave" fi DW_ACCUMULO_SHELL_INIT_SCRIPT="${DW_ACCUMULO_SHELL_INIT_SCRIPT} quit " } function createBuildPropertiesDirectory() { if [ ! -d ${DW_DATAWAVE_BUILD_PROPERTIES_DIR} ] ; then if ! mkdir -p ${DW_DATAWAVE_BUILD_PROPERTIES_DIR} ; then error "Failed to create directory ${DW_DATAWAVE_BUILD_PROPERTIES_DIR}" return 1 fi fi return 0 } function setBuildPropertyOverrides() { # DataWave's build configs (*.properties) can be loaded from a variety of locations based on the 'read-properties' # Maven plugin configuration. Typically, the source-root/properties/*.properties files are loaded first to provide # default values, starting with 'default.properties', followed by '{selected-profile}.properties'. Finally, # ~/.m2/datawave/properties/{selected-profile}.properties is loaded, if it exists, allowing you to override # defaults as needed # With that in mind, the goal of this function is to generate a new '${DW_DATAWAVE_BUILD_PROFILE}.properties' file under # DW_DATAWAVE_BUILD_PROPERTIES_DIR and *symlinked* as ~/.m2/datawave/properties/${DW_DATAWAVE_BUILD_PROFILE}.properties, # to inject all the overrides that we need for successful deployment to source-root/contrib/datawave-quickstart/ # If a file having the name '${DW_DATAWAVE_BUILD_PROFILE}.properties' already exists under ~/.m2/datawave/properties, # then it will be renamed automatically with a ".saved-by-quickstart-$(date)" suffix, and the symlink for the new # file will be created as required local BUILD_PROPERTIES_BASENAME=${DW_DATAWAVE_BUILD_PROFILE}.properties local BUILD_PROPERTIES_FILE=${DW_DATAWAVE_BUILD_PROPERTIES_DIR}/${BUILD_PROPERTIES_BASENAME} local BUILD_PROPERTIES_SYMLINK_DIR=${HOME}/.m2/datawave/properties local BUILD_PROPERTIES_SYMLINK=${BUILD_PROPERTIES_SYMLINK_DIR}/${BUILD_PROPERTIES_BASENAME} ! createBuildPropertiesDirectory && error "Failed to override properties!" && return 1 # Create symlink directory if it doesn't exist [ ! -d ${BUILD_PROPERTIES_SYMLINK_DIR} ] \ && ! mkdir -p ${BUILD_PROPERTIES_SYMLINK_DIR} \ && error "Failed to create symlink directory ${BUILD_PROPERTIES_SYMLINK_DIR}" \ && return 1 # Copy existing source-root/properties/${DW_DATAWAVE_BUILD_PROFILE}.properties to our new $BUILD_PROPERTIES_FILE ! cp "${DW_DATAWAVE_SOURCE_DIR}/properties/${DW_DATAWAVE_BUILD_PROFILE}.properties" ${BUILD_PROPERTIES_FILE} \ && error "Aborting property overrides! Failed to copy ${DW_DATAWAVE_BUILD_PROFILE}.properties" \ && return 1 # Apply overrides as needed by simply appending them to the end of the file... echo "#" >> ${BUILD_PROPERTIES_FILE} echo "######## Begin overrides for datawave-quickstart ########" >> ${BUILD_PROPERTIES_FILE} echo "#" >> ${BUILD_PROPERTIES_FILE} echo "WAREHOUSE_ACCUMULO_HOME=${ACCUMULO_HOME}" >> ${BUILD_PROPERTIES_FILE} echo "WAREHOUSE_INSTANCE_NAME=${DW_ACCUMULO_INSTANCE_NAME}" >> ${BUILD_PROPERTIES_FILE} echo "WAREHOUSE_JOBTRACKER_NODE=${DW_HADOOP_RESOURCE_MANAGER_ADDRESS}" >> ${BUILD_PROPERTIES_FILE} echo "INGEST_ACCUMULO_HOME=${ACCUMULO_HOME}" >> ${BUILD_PROPERTIES_FILE} echo "INGEST_INSTANCE_NAME=${DW_ACCUMULO_INSTANCE_NAME}" >> ${BUILD_PROPERTIES_FILE} echo "INGEST_JOBTRACKER_NODE=${DW_HADOOP_RESOURCE_MANAGER_ADDRESS}" >> ${BUILD_PROPERTIES_FILE} echo "BULK_INGEST_DATA_TYPES=${DW_DATAWAVE_INGEST_BULK_DATA_TYPES}" >> ${BUILD_PROPERTIES_FILE} echo "LIVE_INGEST_DATA_TYPES=${DW_DATAWAVE_INGEST_LIVE_DATA_TYPES}" >> ${BUILD_PROPERTIES_FILE} echo "PASSWORD=${DW_ACCUMULO_PASSWORD}" >> ${BUILD_PROPERTIES_FILE} echo "ZOOKEEPER_HOME=${ZOOKEEPER_HOME}" >> ${BUILD_PROPERTIES_FILE} echo "HADOOP_HOME=${HADOOP_HOME}" >> ${BUILD_PROPERTIES_FILE} echo "MAPRED_HOME=${HADOOP_HOME}" >> ${BUILD_PROPERTIES_FILE} echo "WAREHOUSE_HADOOP_CONF=${HADOOP_CONF_DIR}" >> ${BUILD_PROPERTIES_FILE} echo "INGEST_HADOOP_CONF=${HADOOP_CONF_DIR}" >> ${BUILD_PROPERTIES_FILE} echo "HDFS_BASE_DIR=${DW_DATAWAVE_INGEST_HDFS_BASEDIR}" >> ${BUILD_PROPERTIES_FILE} echo "MAPRED_INGEST_OPTS=${DW_DATAWAVE_MAPRED_INGEST_OPTS}" >> ${BUILD_PROPERTIES_FILE} echo "LOG_DIR=${DW_DATAWAVE_INGEST_LOG_DIR}" >> ${BUILD_PROPERTIES_FILE} echo "FLAG_DIR=${DW_DATAWAVE_INGEST_FLAGFILE_DIR}" >> ${BUILD_PROPERTIES_FILE} echo "FLAG_MAKER_CONFIG=${DW_DATAWAVE_INGEST_FLAGMAKER_CONFIGS}" >> ${BUILD_PROPERTIES_FILE} echo "BIN_DIR_FOR_FLAGS=${DW_DATAWAVE_INGEST_HOME}/bin" >> ${BUILD_PROPERTIES_FILE} echo "KEYSTORE=${DW_DATAWAVE_KEYSTORE}" >> ${BUILD_PROPERTIES_FILE} echo "KEYSTORE_TYPE=${DW_DATAWAVE_KEYSTORE_TYPE}" >> ${BUILD_PROPERTIES_FILE} echo "KEYSTORE_PASSWORD=${DW_DATAWAVE_KEYSTORE_PASSWORD}" >> ${BUILD_PROPERTIES_FILE} echo "TRUSTSTORE=${DW_DATAWAVE_TRUSTSTORE}" >> ${BUILD_PROPERTIES_FILE} echo "TRUSTSTORE_PASSWORD=${DW_DATAWAVE_TRUSTSTORE_PASSWORD}" >> ${BUILD_PROPERTIES_FILE} echo "TRUSTSTORE_TYPE=${DW_DATAWAVE_TRUSTSTORE_TYPE}" >> ${BUILD_PROPERTIES_FILE} echo "FLAG_METRICS_DIR=${DW_DATAWAVE_INGEST_FLAGMETRICS_DIR}" >> ${BUILD_PROPERTIES_FILE} echo "accumulo.instance.name=${DW_ACCUMULO_INSTANCE_NAME}" >> ${BUILD_PROPERTIES_FILE} echo "accumulo.user.password=${DW_ACCUMULO_PASSWORD}" >> ${BUILD_PROPERTIES_FILE} echo "cached.results.hdfs.uri=${DW_HADOOP_DFS_URI}" >> ${BUILD_PROPERTIES_FILE} echo "type.metadata.hdfs.uri=${DW_HADOOP_DFS_URI}" >> ${BUILD_PROPERTIES_FILE} echo "mapReduce.hdfs.uri=${DW_HADOOP_DFS_URI}" >> ${BUILD_PROPERTIES_FILE} echo "bulkResults.hdfs.uri=${DW_HADOOP_DFS_URI}" >> ${BUILD_PROPERTIES_FILE} echo "jboss.log.hdfs.uri=${DW_HADOOP_DFS_URI}" >> ${BUILD_PROPERTIES_FILE} echo "lock.file.dir=${DW_DATAWAVE_INGEST_LOCKFILE_DIR}" >> ${BUILD_PROPERTIES_FILE} echo "server.keystore.password=${DW_ACCUMULO_PASSWORD}" >> ${BUILD_PROPERTIES_FILE} echo "mysql.user.password=${DW_ACCUMULO_PASSWORD}" >> ${BUILD_PROPERTIES_FILE} echo "jboss.jmx.password=${DW_ACCUMULO_PASSWORD}" >> ${BUILD_PROPERTIES_FILE} echo "jboss.managed.executor.service.default.max.threads=${DW_WILDFLY_EE_DEFAULT_MAX_THREADS:-48}" >> ${BUILD_PROPERTIES_FILE} echo "hornetq.cluster.password=${DW_ACCUMULO_PASSWORD}" >> ${BUILD_PROPERTIES_FILE} echo "hornetq.system.password=${DW_ACCUMULO_PASSWORD}" >> ${BUILD_PROPERTIES_FILE} echo "mapReduce.job.tracker=${DW_HADOOP_RESOURCE_MANAGER_ADDRESS}" >> ${BUILD_PROPERTIES_FILE} echo "bulkResults.job.tracker=${DW_HADOOP_RESOURCE_MANAGER_ADDRESS}" >> ${BUILD_PROPERTIES_FILE} echo "EVENT_DISCARD_INTERVAL=0" >> ${BUILD_PROPERTIES_FILE} echo "ingest.data.types=${DW_DATAWAVE_INGEST_LIVE_DATA_TYPES},${DW_DATAWAVE_INGEST_BULK_DATA_TYPES}" >> ${BUILD_PROPERTIES_FILE} echo "JOB_CACHE_REPLICATION=1" >> ${BUILD_PROPERTIES_FILE} echo "EDGE_DEFINITION_FILE=${DW_DATAWAVE_INGEST_EDGE_DEFINITIONS}" >> ${BUILD_PROPERTIES_FILE} echo "DATAWAVE_INGEST_HOME=${DW_DATAWAVE_INGEST_HOME}" >> ${BUILD_PROPERTIES_FILE} echo "PASSWORD_INGEST_ENV=${DW_DATAWAVE_INGEST_PASSWD_FILE}" >> ${BUILD_PROPERTIES_FILE} echo "hdfs.site.config.urls=file://${HADOOP_CONF_DIR}/core-site.xml,file://${HADOOP_CONF_DIR}/hdfs-site.xml" >> ${BUILD_PROPERTIES_FILE} echo "table.shard.numShardsPerDay=${DW_DATAWAVE_INGEST_NUM_SHARDS}" >> ${BUILD_PROPERTIES_FILE} generateTestDatawaveUserServiceConfig # Apply DW_JAVA_HOME_OVERRIDE, if needed... # We can override the JAVA_HOME location for the DataWave deployment, if necessary. E.g., if we're deploying # to a Docker container or other, where our current JAVA_HOME isn't applicable if [ -n "${DW_JAVA_HOME_OVERRIDE}" ] ; then echo "JAVA_HOME=${DW_JAVA_HOME_OVERRIDE}" >> ${BUILD_PROPERTIES_FILE} else echo "JAVA_HOME=${JAVA_HOME}" >> ${BUILD_PROPERTIES_FILE} fi # Apply DW_ROOT_DIRECTORY_OVERRIDE, if needed... # We can override any instances of DW_DATAWAVE_SOURCE_DIR within the build config in order to relocate # the deployment, if necessary. E.g., used when building the datawave-quickstart Docker image to reorient # the deployment under /opt/datawave/ within the container if [ -n "${DW_ROOT_DIRECTORY_OVERRIDE}" ] ; then sed -i "s~${DW_DATAWAVE_SOURCE_DIR}~${DW_ROOT_DIRECTORY_OVERRIDE}~g" ${BUILD_PROPERTIES_FILE} fi # Create the symlink under ~/.m2/datawave/properties setBuildPropertiesSymlink || return 1 } function setBuildPropertiesSymlink() { # Replace any existing ~/.m2/datawave/properties/${BUILD_PROPERTIES_BASENAME} file/symlink with # a symlink to our new ${BUILD_PROPERTIES_FILE} if [[ -f ${BUILD_PROPERTIES_SYMLINK} || -L ${BUILD_PROPERTIES_SYMLINK} ]] ; then if [ -L ${BUILD_PROPERTIES_SYMLINK} ] ; then info "Unlinking existing symbolic link: ${BUILD_PROPERTIES_SYMLINK}" if ! unlink "${BUILD_PROPERTIES_SYMLINK}" ; then warn "Failed to unlink $( readlink ${BUILD_PROPERTIES_SYMLINK} ) from ${BUILD_PROPERTIES_SYMLINK_DIR}" fi else local backupFile="${BUILD_PROPERTIES_SYMLINK}.saved-by-quickstart.$(date +%Y-%m-%d-%H%M%S)" info "Backing up your existing ~/.m2/**/${BUILD_PROPERTIES_BASENAME} file to ~/.m2/**/$( basename ${backupFile} )" if ! mv "${BUILD_PROPERTIES_SYMLINK}" "${backupFile}" ; then error "Failed to backup ${BUILD_PROPERTIES_SYMLINK}. Aborting properties file override. Please fix me!!" return 1 fi fi fi if ln -s "${BUILD_PROPERTIES_FILE}" "${BUILD_PROPERTIES_SYMLINK}" ; then info "Override for ${BUILD_PROPERTIES_BASENAME} successful" else error "Override for ${BUILD_PROPERTIES_BASENAME} failed" return 1 fi } function datawaveBuildSucceeded() { local success=$( tail -n 7 "$DW_DATAWAVE_BUILD_STATUS_LOG" | grep "BUILD SUCCESS" ) if [ -z "${success}" ] ; then return 1 fi return 0 } function buildDataWave() { if ! mavenIsInstalled ; then ! mavenInstall && error "Maven install failed. Please correct" && return 1 fi [[ "$1" == "--verbose" ]] && local verbose=true ! setBuildPropertyOverrides && error "Aborting DataWave build" && return 1 [ -f "${DW_DATAWAVE_BUILD_STATUS_LOG}" ] && rm -f "$DW_DATAWAVE_BUILD_STATUS_LOG" info "DataWave build in progress: '${DW_DATAWAVE_BUILD_COMMAND}'" info "Build status log: $DW_DATAWAVE_BUILD_STATUS_LOG" if [ "${verbose}" == true ] ; then ( cd "${DW_DATAWAVE_SOURCE_DIR}" && eval "${DW_DATAWAVE_BUILD_COMMAND}" 2>&1 | tee ${DW_DATAWAVE_BUILD_STATUS_LOG} ) else ( cd "${DW_DATAWAVE_SOURCE_DIR}" && eval "${DW_DATAWAVE_BUILD_COMMAND}" &> ${DW_DATAWAVE_BUILD_STATUS_LOG} ) fi if ! datawaveBuildSucceeded ; then error "The build has FAILED! See $DW_DATAWAVE_BUILD_STATUS_LOG for details" return 1 fi info "DataWave build successful" return 0 } function getDataWaveTarball() { # Looks for a DataWave tarball matching the specified pattern and, if found, sets the global 'tarball' # variable to its basename for the caller as expected. # If no tarball is found matching the specified pattern, then the DataWave build is kicked off local tarballPattern="${1}" tarball="" # Check if the tarball already exists in the plugin directory. local tarballPath="$( find "${DW_DATAWAVE_SERVICE_DIR}" -path "${tarballPattern}" -type f )" if [ -f "${tarballPath}" ]; then tarball="$( basename "${tarballPath}" )" return 0; fi ! buildDataWave --verbose && error "Please correct this issue before continuing" && return 1 # Build succeeded. Set global 'tarball' variable for the specified pattern and copy all tarballs into place tarballPath="$( find "${DW_DATAWAVE_SOURCE_DIR}" -path "${tarballPattern}" -type f | tail -1 )" [ -z "${tarballPath}" ] && error "Failed to find '${tarballPattern}' tar file after build" && return 1 tarball="$( basename "${tarballPath}" )" # Current caller (ie, either bootstrap-web.sh or bootstrap-ingest.sh) only cares about current $tarball, # but go ahead and copy both tarballs into datawave service dir to satisfy next caller as well ! copyDataWaveTarball "${DW_DATAWAVE_INGEST_TARBALL}" && error "Failed to copy DataWave Ingest tarball" && return 1 ! copyDataWaveTarball "${DW_DATAWAVE_WEB_TARBALL}" && error "Failed to copy DataWave Web tarball" && return 1 return 0 } function copyDataWaveTarball() { local pattern="${1}" local dwTarball="$( find "${DW_DATAWAVE_SOURCE_DIR}" -path "${pattern}" -type f | tail -1 )"; if [ -n "${dwTarball}" ] ; then ! cp "${dwTarball}" "${DW_DATAWAVE_SERVICE_DIR}" && error "Failed to copy '${dwTarball}'" && return 1 else error "No tar file found matching '${pattern}'" return 1 fi return 0 } # Bootstrap DW ingest and webservice components as needed source "${DW_DATAWAVE_SERVICE_DIR}/bootstrap-ingest.sh" source "${DW_DATAWAVE_SERVICE_DIR}/bootstrap-web.sh" function datawaveIsRunning() { datawaveIngestIsRunning && return 0 datawaveWebIsRunning && return 0 return 1 } function datawaveStart() { datawaveIngestStart datawaveWebStart } function datawaveStop() { datawaveIngestStop datawaveWebStop } function datawaveStatus() { datawaveIngestStatus datawaveWebStatus } function datawaveIsInstalled() { datawaveIngestIsInstalled && return 0 datawaveWebIsInstalled && return 0 return 1 } function datawaveUninstall() { datawaveIngestUninstall datawaveWebUninstall [[ "${1}" == "${DW_UNINSTALL_RM_BINARIES_FLAG_LONG}" || "${1}" == "${DW_UNINSTALL_RM_BINARIES_FLAG_SHORT}" ]] && rm -f "${DW_DATAWAVE_SERVICE_DIR}"/*.tar.gz } function datawaveInstall() { datawaveIngestInstall datawaveWebInstall } function datawavePrintenv() { echo echo "DataWave Environment" echo ( set -o posix ; set ) | grep -E "DATAWAVE_|WILDFLY|JBOSS" echo } function datawavePidList() { datawaveIngestIsRunning datawaveWebIsRunning if [[ -n "${DW_DATAWAVE_WEB_PID_LIST}" || -n "${DW_DATAWAVE_INGEST_PID_LIST}" ]] ; then echo "${DW_DATAWAVE_WEB_PID_LIST} ${DW_DATAWAVE_INGEST_PID_LIST}" fi } function datawaveBuildDeploy() { datawaveIsRunning && info "Stopping all DataWave services" && datawaveStop datawaveIsInstalled && info "Uninstalling DataWave" && datawaveUninstall --remove-binaries resetQuickstartEnvironment export DW_REDEPLOY_IN_PROGRESS=true datawaveInstall export DW_REDEPLOY_IN_PROGRESS=false } function datawaveBuild() { info "Building DataWave" rm -f "${DW_DATAWAVE_SERVICE_DIR}"/datawave*.tar.gz resetQuickstartEnvironment } 
Enter fullscreen mode Exit fullscreen mode

Let's try Datawave out finally

So let's find some Wikipedia data based on page title, with a --verbose flag to see the cURL command in action with Datawave:

datawaveQuery --query "PAGE_TITLE:AccessibleComputing OR PAGE_TITLE:Anarchism" --verbose 
Enter fullscreen mode Exit fullscreen mode

Next let's grab TV show data from API.TVMAZE.COM(graph edge queries):

datawaveQuery --logic EdgeQuery --syntax JEXL --query "SOURCE == 'kevin bacon' && TYPE == 'TV_COSTARS'" --pagesize 30 
Enter fullscreen mode Exit fullscreen mode

Then, let's do our graph edge queries:

datawaveQuery --logic EdgeQuery --syntax JEXL --query "SOURCE == 'william shatner' && TYPE == 'TV_CHARACTERS'" 
Enter fullscreen mode Exit fullscreen mode

Let's try doing one more EdgeQuery:

datawaveQuery --logic EdgeQuery --syntax JEXL --query "SOURCE == 'westworld' && TYPE == 'TV_SHOW_CAST'" --pagesize 20 
Enter fullscreen mode Exit fullscreen mode

That was cool right? To formulate some of your own graph/edge queries run the following:

datawaveQuery --help 
Enter fullscreen mode Exit fullscreen mode

This will give you a broader scope on how Edges work.

Edges

One of the things I thought was, "EdgeQueryLogic in Datawave needs a better date range filter".

Image description

So to be clear, currently as it stands EdgeQueryLogic currently uses a column qualifier range filter to skip keys that are not within a specified date range. We need to be able to incorporate seeks such that the filter will skip over entries not within the date range. This is expected to be significantly faster when there are large gaps in sequence of edges for a source value that are not within the date range.

As you can see, this can come up short. An example subroutine that would help with Edge queries, would look like this:

private boolean seekToStartKey(Key topKey, String date) throws IOException boolean seeked = false; if (startDate != null && date.compareTo(startDate) < 0) { // Date is before start date. Seek to same key with date set to start date Key newKey = EdgeKeyUtil.getSeekToFutureKey(topKey, startDate); PartialKey pk = EdgeKeyUtil.getSeekToFuturePartialKey(); fastReseek(newKey, pk); } else if (endDate != null & date.compareTo(endDate) > 0) { PartialKey part = EdgeKeyUtil.getSeekToNextKey(); fastReseek(topKey.followingKey(part), part); seeked = true; } return seeked; } 
Enter fullscreen mode Exit fullscreen mode

This now is getting into the core Datawave architecture. I've now helped you install it, run some queries on Wikipedia, and use an API for TV shows to demonstrate Edge querying.

Datawave Architecture

As you can imagine with things like ZooKeeper, Hadoop of course you'll be using things like MapReduce and Pig, below is the more-or-less the architecture of Datawave:

Image description

With this architecture, you might be thinking -- "What about run away queries?" Good question.

Your query can get into a state where the QueryIterator will not terminate. This is an edge case where yielding occurs, and the FinalDocument is returned as a result. This will get us into an infinite loop where the upon rebuilding the iterator, it will yield again and again.

Two things are required:

  • The yield callback is checked in the FinalDocumentIterator and the yield is passed through appropriately.
  • The underlying iterator is no longer checked once the final document is returned.

This is just mainly another NB, this can happen, and if it does there's ways to do command jujitsu to get out of it.

Conclusion

We used EdgeQuerying got info from Wikipedia and gave it conditionals, and used some API's via REST to do some cool things, these are just some of the VERY few things Datawave can do, and that I've personally have done myself. I may do a part 2 series on this.

  • Montana Mendy

Top comments (0)