Retrieve Daily Weather - ETL
You’ve been tasked by your team to create an automated Extract, Transform, Load (ETL) process to extract daily weather forecast and observed weather data and load it into a live report to be used for further analysis by the analytics team. As part of a larger prediction modelling project, the team wants to use the report to monitor and measure the historical accuracy of temperature forecasts by source and station.
As a proof-of-concept (POC), you are only required to do this for a single station and one source to begin with. For each day at noon (local time), you will gather both the actual temperature and the temperature forecasted for noon on the following day for Casablanca, Morocco.
At a later stage, the team anticipates extending the report to include lists of locations, different forecasting sources, different update frequencies, and other weather metrics such as wind speed and direction, precipitation, and visibility.
Data
We’ll use the weather data package provided by the open source project wttr.in, a web service that provides weather forecast information in a simple and text-based format. For further information, you can read more about the service on its GitHub Repo.
- First, you’ll use the
curl
command to scrape weather data via the wttr.in website. For example, to get data for Casablanca - You can actually go to their website wttr.in and if you don’t specify a location it will provide your local weather forecast, in the provided link I specifically targeted Casablanca
/casablanca curl wttr.in
- which prints the following to
stdout
Tasks
- Download raw weather data
- Extract data of interest from the raw data
- Transform the data as required
- Load the data into a log file using a tabular format
- Schedule the entire process to run automatically at a set time daily
Weather reporting tasks
You must extract and store the following data every day at noon, local time, for Casablanca, Morocco:
- The actual temperature (in degrees Celsius)
- The forecasted temperature (in degrees Celsius) for the following day at noon
Here is an example of what the the resulting weather report should look like:
year | month | day | obs_tmp | fc_temp |
---|---|---|---|---|
2023 | 1 | 1 | 10 | 11 |
2023 | 1 | 2 | 11 | 12 |
2023 | 1 | 3 | 12 | 10 |
2023 | 1 | 4 | 13 | 13 |
2023 | 1 | 5 | 10 | 9 |
2023 | 1 | 6 | 11 | 10 |
… | … | … | … | … |
Setup Output File
Create text file
This will be your POC weather report log file, simply a text file which contains a growing history of the daily weather data you will scrape.
Each entry in the log file corresponds to a row as in Table 1.
$ touch rx_poc.log
Add Header to Report
Your header should consist of the column names from Table 1, delimited by tabs.
- Write the header to your weather report.
- Using variables in such case makes for much cleaner code, which is easier to understand and safer to modify by others or even yourself at a later date.
- echo/print the header to the file
$ header=$(echo -e "year\tmonth\tday\thour\tobs_tmp\tfc_temp")
$ echo $header>rx_poc.log
# test the file header
$ cat rx_poc.log
year month day hour obs_tmp fc_temp
# check permissions
$ ls -l rx_poc.log
-rw-r--r-- 1 theia users 36 Sep 26 10:36 rx_poc.log
- Can also use a redirection
-e "year\tmonth\tday\thour\tobs_tmp\tfc_temp">rx_poc.log echo
Write Script File
Create a BASH Script file
- Make this an executable filecalled rx_poc.sh
$ touch rx_poc.sh
# check permissions
$ ls -l rx_poc.sh
-rw-r--r-- 1 theia users 0 Sep 26 10:39 rx_poc.sh
Set Permissions
+x rx_poc.sh
chmod u# check permissions
$ ls -l rx_poc.sh
-rwxr--r-- 1 theia users 0 Sep 26 10:39 rx_poc.sh
- make it a bash script (with the .sh)
#! /bin/bash
Download Weather Data
Tip: It’s good practice to keep a copy of the raw data you are using to extract the data you need.
- By appending a date or time stamp to the file name, you can ensure it’s name is unique.
- This builds a history of the weather forecasts which you can revisit at any time to recover from errors or expand the scope of your reports
- Using the prescribed date format ensures that when you sort the files, they will be sorted chronologically. It also enables searching for the report for any given date.
- If needed, you can compress and archive the files periodically. You can even automate this process by scheduling it.
Setup File
Download and save your report as a datestamped file named raw_data_<YYYMMDD>
# Setup file to download weather data for today into raw_data_<timestamp>
=$(date +%Y%m%d)
today# gives today the value of
$ echo $today
20240926
=raw_data_$today
weather_report# this sets the vale at
$ echo $weather_report
raw_data_20240926# So this is our filename for today
Download Casablanca Data
- Download the wttr.in weather report for Casablanca and save it to raw_data_<timestamp> above
# Download Casablanca data from wttr.in into the raw_data_<timestamp> file
=Casablanca
city/$city --output $weather_report
curl wttr.in
# CONFIRMATION
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 100 9087 100 9087 0 0 8500 0 0:00:01 0:00:01 --:--:-- 8100 9087 100 9087 0 0 8500 0 0:00:01 0:00:01 --:--:-- 8500
Extract Data
- Before extracting the data let’s look at it
- remember our file name above
$ cat raw_data_20240926
- Extract the required data from the raw file and assign them to variables obs_tmp and fc_temp the two columns in the table
- Extracting the required data is a process that will take some trial and error until you get it right.
Study the weather report you downloaded, and determine what you need to extract. Look for patterns.
You must find a way to ‘chip away’ at the weather report:- Use shell commands to extract only the data you need (the signal)
- Filter everything else out (the noise)
- Combine your filters into a pipeline (recall the use of pipes to chain filters together)
tr
Trim repeated characters to a single character.
$ echo "There are too many spaces in this sentence." | tr -s " "
in this sentence. There are too many spaces
xargs
xargs can be used to trim leading and trailing spaces from a string
$ echo " Never start or end a sentence with a space. " | xargs
Never start or end a sentence with a space.
rev
reverse the order of characters on a line of text
$ echo ".sdrawkcab saw ecnetnes sihT" | rev
This sentence was backwards.
cut
rev, and xargs are very useful when combined with the cut comand
# print the last field of the string
$ echo "three two one" | rev | cut -d " " -f 1 | rev
one
# Unfortunately, this prints the last field of the string, which is empty:
$ echo "three two one " | rev | cut -d " " -f 1 | rev
# But if you trim the trailing space first, you get the expected result:
$ echo "three two one " | xargs | rev | cut -d " " -f 1 | rev
one
Extract Data
- Extract only those lines that contain temperatures from the weather report, and write your result to file.
- Use
grep
and redirect the result to file. - Extract current temp and store it in a shell variable
obs_tmp
- which line is the current temp on
- Is there a character you can use as a delimiter to appropriately parse the line into fields?
- How about any leading or trailing white spaces?
- Extract tomorrow’s forecast for noon, and store it in sell variable
fc_tmp
- Follow the pipeline we just created for temp
# Extract the data from the raw data file
$weather_report > temperatures.txt
grep °C
# Extract current temp and store it in sell var obs_tmp
=$(head -1 temperatures.txt | tr -s " " | xargs | rev | cut -d " " -f2 | rev)
obs_tmp
# Extract tomorrow's forecast for noon and store in fc_tmp
=$(head -3 temperatures.txt | tail -1 | tr -s " " | xargs | cut -d "C" -f2 | rev | cut -d " " -f2 | rev) fc_temp
Store Timestamp
- Store the current hour, day, month, and year in variable
- Don’t set the time to 12 for noon because it is not the local time for the city we are extracting data for
- Use command substitution and the date command with the correct formatting options.
The time zone for Casablanca happens to be UTC+1. - To get the local time for Casablanca, you can set the time-zone environment variable, TZ, as follows:
TZ='Morocco/Casablanca'
# Store timestamp in its variable
=$(TZ='Morocco/Casablanca' date -u +%H)
hour=$(TZ='Morocco/Casablanca' date -u +%d)
day=$(TZ='Morocco/Casablanca' date +%m)
month=$(TZ='Morocco/Casablanca' date +%Y) year
Merge Fields into Tab Record
- Merge the fields into a tab-delimited record, corresponding to a single row in Table 1.
- Append the resulting record as a row of data to your weather log file.
- create a tab-delimited record
- recall the header was created as follows:
- header=$(echo -e “year\tmonth\tday\thour_UTC\tobs_tmp\tfc_temp”)
- echo $header>rx_poc.log
# Append the resulting record as a row into our log file
=$(echo -e "$year\t$month\t$day\t$hour\t$obs_tmp\t$fc_temp")
record$record>>rx_poc.log
echo
# make sure you save the script file - shown at end of page
Schedule Task
- Remember that we wanted to load the data at noon local time for subject city
- Check the time difference between your system’s default time and UTC
- Calculate the difference
# If date is my system time then date -u will be my date - UTC and we see a difference ofUTC+5, 5 hours ahead of mine
$ date
13 11:28:12 EST 2023
Mon Feb $ date -u
13 16:28:16 UTC 2023
Mon Feb
# we know from info given that Casablanca is UTC+1 so that makes it 5-1=4 hours before me
Edit Crontab
-e
crontab 0 8 * * * /home/project/rx_poc.sh
# save the file and exit editor
Script File
#! /bin/bash
# Create file to download weather data for today into raw_data_<timestamp>
=$(date +%Y%m%d)
today=raw_data_$today
weather_report
# Download Casablanca data from wttr.in into the raw_data_<timestamp> file
=Casablanca
city/$city --output $weather_report
curl wttr.in
# Extract the data from the raw data file
$weather_report > temperatures.txt
grep °C
# Extract current temp and store it in sell var obs_tmp
=$(head -1 temperatures.txt | tr -s " " | xargs | rev | cut -d " " -f2 | rev)
obs_tmp
# Extract tomorrow's forecast for noon and store in fc_tmp
=$(head -3 temperatures.txt | tail -1 | tr -s " " | xargs | cut -d "C" -f2 | rev | cut -d " " -f2 | rev)
fc_temp
# Store timestamp in its variable
=$(TZ='Morocco/Casablanca' date -u +%H)
hour=$(TZ='Morocco/Casablanca' date -u +%d)
day=$(TZ='Morocco/Casablanca' date +%m)
month=$(TZ='Morocco/Casablanca' date +%Y)
year
# Append the resulting record as a row into our log file
=$(echo -e "$year\t$month\t$day\t$hour\t$obs_tmp\t$fc_temp")
record$record>>rx_poc.log echo
Local Drive Script
This section will go through the same process but done on the local drive instead of the cloud as explained above
Change Directories
# ___ change directories
:\..> cd D:\data
PS C
# ___ create directory
:\data> mkdir Linux_projects
PS D:\data> cd Linux_projects
PS D
# ___ create file to contain output
:\data\Linux_projects> New-Item rx_poc.log
PS D# ___ add header to file (column names)