Import Tar ball Save in PostgreSQL

Task 1

Import tar ball, transform, load into the table ‘access_log’ in the PostgreSQL database ‘template1’

  • The file is available at the location :
  • The following are the columns and their data types in the file:
  • a. timestamp - TIMESTAMP
  • b. latitude - float
  • c. longitude - float
  • d. visitorid - char(37)
  • e. accessed_from_mobile - boolean
  • f. browser_code - int
  • The columns which we need to copy to the table are the first four coumns : timestamp, latitude, longitude and visitorid.

NOTE: The file comes with a header. So use the ‘HEADER’ option in the ‘COPY’ command.

Task 2

The problem may be solved by completing the following tasks:

  • Go to the SkillsNetwork Tools menu and start the Postgres SQL server if it is not already running.
  • Create a table named access_log to store the timestamp, latitude, longitude and visitorid.

Step 1:

Open the Postgres SQL CLI, if it is not already open.

Step 2:

At the postgres=# prompt, run the following command to connect to the database ‘template1’.

\c template1;

Step 3:

Once you connect to the database, run the command to create the table called ‘access_log’:

CREATE TABLE access_log(timestamp TIMESTAMP, latitude float, longitude float, visitor_id char(37));

Task 3

  • Create a shell script named cp-access-log.sh and add commands to complete the remaining tasks to extract and copy the data to the database.
  • Create a shell script to add commands to complete the rest of the tasks.
  • Run the following command in a new terminal to create the file cp-access-log.sh.
touch cp-access-log.sh
  • Open the file in the editor add appropriate comments to the script.
  • Copy and paste the following lines into the newly created file and save it.
# cp-access-log.sh
# This script downloads the file 'web-server-access-log.txt.gz'
# from "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0250EN-SkillsNetwork/labs/Bash%20Scripting/ETL%20using%20shell%20scripting/".

# The script then extracts the .txt file using gunzip.

# The .txt file contains the timestamp, latitude, longitude 
# and visitor id apart from other data.

# Transforms the text delimeter from "#" to "," and saves to a csv file.
# Loads the data from the CSV file into the table 'access_log' in PostgreSQL database.

Task 4

  • Download the access log file.
  • Add the wget command to the script to download the file.
wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0250EN-SkillsNetwork/
labs/Bash%20Scripting/ETL%20using%20shell%20scripting/web-server-access-log.txt.gz"

Task 5

  • Unzip the gzip file
  • Add the code, to run the gunzip command to unzip the .gz file and extract the .txt file, to the script.
# Unzip the file to extract the .txt file.
gunzip -f web-server-access-log.txt.gz

The -f option of gunzip is to overwrite the file if it already exists.

Task 6

  • Extract required fields from the file.
  • Extract timestamp, latitude, longitude and visitorid which are the first four fields from the file using the cut command.
  • The columns in the web-server-access-log.txt file is delimited by ‘#’.
  • Copy the following lines and add them to the end of the script.
# Extract phase

echo "Extracting data"

# Extract the columns 1 (timestamp), 2 (latitude), 3 (longitude) and 
# 4 (visitorid)

cut -d"#" -f1-4 web-server-access-log.txt
  • Save the file
  • Run the script
bash cp-access-log.sh
  • Verify that the output contains all the four fields that we extracted

Task 7

  • Redirect the extracted output into a file.
  • Redirect the extracted data into a file named extracted-data.txt
  • Replace the cut command at end of the script with the following command and save the file.
cut -d"#" -f1-4 web-server-access-log.txt > extracted-data.txt
  • Run the script
bash cp-access-log.sh
  • Run the command below to verigy that the file extracted-data.txt is created and has the content
cat extracted-data.txt

Task 8

  • Transform the data into CSV format.
  • The extracted columns are separated by the original “#” delimiter.
  • We need to convert this into a “,” delimited file.
  • Add the lines below at the end of the script
# Transform phase
echo "Transforming data"

# read the extracted data and replace the colons with commas and
# write it to a csv file
tr "#" "," < extracted-data.txt > transformed-data.csv
  • Save the file
  • Run the script
bash cp-access-log.sh
  • Run the command below to verify that the file ‘transformed-data.csv’ is created, and has the content.
cat transformed-data.csv

Task 9

  • Load the data into the table access_log in PostgreSQL
  • PostgreSQL command to copy data from a CSV file to a table is COPY
  • The basic structure of the command is,
COPY table_name FROM 'filename' DELIMITERS 'delimiter_character' FORMAT;
  • The file comes with a header. So use the ‘HEADER’ option in the ‘COPY’ command.
  • Invoke this command from the shellscript, by sending it as input to ‘psql’ filter command.
  • Add the lines below to the end of the script and save the file
# Load phase
echo "Loading data"

# Send the instructions to connect to 'template1' and
# copy the file to the table 'access_log' through command pipeline.

echo "\c template1;\COPY access_log  FROM '/home/project/transformed-data.csv' DELIMITERS ',' CSV HEADER;" | psql --username=postgres --host=localhost

Task 10

  • Execute the final script
  • Run the final script
  • Run the following command at the terminal
bash cp-access-log.sh
  • The completed script is here below
# cp-access-log.sh
# This script downloads the file 'web-server-access-log.txt.gz'
# from "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0250EN-SkillsNetwork/labs/Bash%20Scripting/ETL%20using%20shell%20scripting/".

# The script then extracts the .txt file using gunzip.

# The .txt file contains the timestamp, latitude, longitude 
# and visitor id apart from other data.

# Transforms the text delimeter from "#" to "," and saves to a csv file.
# Loads the data from the CSV file into the table 'access_log' in PostgreSQL database.
# Download the access log file

wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0250EN-SkillsNetwork/labs/Bash%20Scripting/ETL%20using%20shell%20scripting/web-server-access-log.txt.gz"
# Unzip the file to extract the .txt file.
gunzip -f web-server-access-log.txt.gz

# Extract phase

echo "Extracting data"

# Extract the columns 1 (timestamp), 2 (latitude), 3 (longitude) and 
# 4 (visitorid)

cut -d"#" -f1-4 web-server-access-log.txt > extracted-data.txt

# Transform phase
echo "Transforming data"

# read the extracted data and replace the colons with commas.
tr "#" "," < extracted-data.txt > transformed-data.csv

# Load phase
echo "Loading data"

# Send the instructions to connect to 'template1' and
# copy the file to the table 'access_log' through command pipeline.

echo "\c template1;\COPY access_log  FROM '/home/project/transformed-data.csv' DELIMITERS ',' CSV HEADER;" | psql --username=postgres --host=localhost

Task 11

  • Verify by querying the database
  • Run the command below at PostgresSQL CLI prompt
SELECT * from access_log;