\c template1;
Import Tar ball Save in PostgreSQL
Task 1
Import tar ball, transform, load into the table ‘access_log’ in the PostgreSQL database ‘template1’
- The file is available at the location :
- The following are the columns and their data types in the file:
- a. timestamp - TIMESTAMP
- b. latitude - float
- c. longitude - float
- d. visitorid - char(37)
- e. accessed_from_mobile - boolean
- f. browser_code - int
- The columns which we need to copy to the table are the first four coumns : timestamp, latitude, longitude and visitorid.
NOTE: The file comes with a header. So use the ‘HEADER’ option in the ‘COPY’ command.
Task 2
The problem may be solved by completing the following tasks:
- Go to the SkillsNetwork Tools menu and start the Postgres SQL server if it is not already running.
- Create a table named
access_log
to store the timestamp, latitude, longitude and visitorid.
Step 1:
Open the Postgres SQL CLI, if it is not already open.
Step 2:
At the postgres=#
prompt, run the following command to connect to the database ‘template1’.
Step 3:
Once you connect to the database, run the command to create the table called ‘access_log’:
access_log(timestamp TIMESTAMP, latitude float, longitude float, visitor_id char(37)); CREATE TABLE
Task 3
- Create a shell script named
cp-access-log.sh
and add commands to complete the remaining tasks to extract and copy the data to the database. - Create a shell script to add commands to complete the rest of the tasks.
- Run the following command in a new terminal to create the file
cp-access-log.sh
.
-access-log.sh touch cp
- Open the file in the editor add appropriate comments to the script.
- Copy and paste the following lines into the newly created file and save it.
# cp-access-log.sh
# This script downloads the file 'web-server-access-log.txt.gz'
# from "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0250EN-SkillsNetwork/labs/Bash%20Scripting/ETL%20using%20shell%20scripting/".
# The script then extracts the .txt file using gunzip.
# The .txt file contains the timestamp, latitude, longitude
# and visitor id apart from other data.
# Transforms the text delimeter from "#" to "," and saves to a csv file.
# Loads the data from the CSV file into the table 'access_log' in PostgreSQL database.
Task 4
- Download the access log file.
- Add the
wget
command to the script to download the file.
"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0250EN-SkillsNetwork/
wget labs/Bash%20Scripting/ETL%20using%20shell%20scripting/web-server-access-log.txt.gz"
Task 5
- Unzip the gzip file
- Add the code, to run the gunzip command to unzip the
.gz
file and extract the.txt
file, to the script.
# Unzip the file to extract the .txt file.
-f web-server-access-log.txt.gz gunzip
The -f option of gunzip is to overwrite the file if it already exists.
Task 6
- Extract required fields from the file.
- Extract timestamp, latitude, longitude and visitorid which are the first four fields from the file using the
cut
command. - The columns in the web-server-access-log.txt file is delimited by ‘#’.
- Copy the following lines and add them to the end of the script.
# Extract phase
"Extracting data"
echo
# Extract the columns 1 (timestamp), 2 (latitude), 3 (longitude) and
# 4 (visitorid)
-d"#" -f1-4 web-server-access-log.txt cut
- Save the file
- Run the script
-access-log.sh bash cp
- Verify that the output contains all the four fields that we extracted
Task 7
- Redirect the extracted output into a file.
- Redirect the extracted data into a file named
extracted-data.txt
- Replace the cut command at end of the script with the following command and save the file.
-d"#" -f1-4 web-server-access-log.txt > extracted-data.txt cut
- Run the script
-access-log.sh bash cp
- Run the command below to verigy that the file
extracted-data.txt
is created and has the content
-data.txt cat extracted
Task 8
- Transform the data into CSV format.
- The extracted columns are separated by the original “#” delimiter.
- We need to convert this into a “,” delimited file.
- Add the lines below at the end of the script
# Transform phase
"Transforming data"
echo
# read the extracted data and replace the colons with commas and
# write it to a csv file
"#" "," < extracted-data.txt > transformed-data.csv tr
- Save the file
- Run the script
-access-log.sh bash cp
- Run the command below to verify that the file ‘transformed-data.csv’ is created, and has the content.
-data.csv cat transformed
Task 9
- Load the data into the table
access_log
in PostgreSQL - PostgreSQL command to copy data from a CSV file to a table is
COPY
- The basic structure of the command is,
'filename' DELIMITERS 'delimiter_character' FORMAT; COPY table_name FROM
- The file comes with a header. So use the ‘HEADER’ option in the ‘COPY’ command.
- Invoke this command from the shellscript, by sending it as input to ‘psql’ filter command.
- Add the lines below to the end of the script and save the file
# Load phase
"Loading data"
echo
# Send the instructions to connect to 'template1' and
# copy the file to the table 'access_log' through command pipeline.
"\c template1;\COPY access_log FROM '/home/project/transformed-data.csv' DELIMITERS ',' CSV HEADER;" | psql --username=postgres --host=localhost echo
Task 10
- Execute the final script
- Run the final script
- Run the following command at the terminal
-access-log.sh bash cp
- The completed script is here below
# cp-access-log.sh
# This script downloads the file 'web-server-access-log.txt.gz'
# from "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0250EN-SkillsNetwork/labs/Bash%20Scripting/ETL%20using%20shell%20scripting/".
# The script then extracts the .txt file using gunzip.
# The .txt file contains the timestamp, latitude, longitude
# and visitor id apart from other data.
# Transforms the text delimeter from "#" to "," and saves to a csv file.
# Loads the data from the CSV file into the table 'access_log' in PostgreSQL database.
# Download the access log file
"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0250EN-SkillsNetwork/labs/Bash%20Scripting/ETL%20using%20shell%20scripting/web-server-access-log.txt.gz"
wget # Unzip the file to extract the .txt file.
-f web-server-access-log.txt.gz
gunzip
# Extract phase
"Extracting data"
echo
# Extract the columns 1 (timestamp), 2 (latitude), 3 (longitude) and
# 4 (visitorid)
-d"#" -f1-4 web-server-access-log.txt > extracted-data.txt
cut
# Transform phase
"Transforming data"
echo
# read the extracted data and replace the colons with commas.
"#" "," < extracted-data.txt > transformed-data.csv
tr
# Load phase
"Loading data"
echo
# Send the instructions to connect to 'template1' and
# copy the file to the table 'access_log' through command pipeline.
"\c template1;\COPY access_log FROM '/home/project/transformed-data.csv' DELIMITERS ',' CSV HEADER;" | psql --username=postgres --host=localhost echo
Task 11
- Verify by querying the database
- Run the command below at PostgresSQL CLI prompt
* from access_log; SELECT