$ wget https://www.yourdataiq.com/TT/Data/Blah/dingo.txt
File Manipulation
Handle Files
wget
The wget
command downloads a text file called usdoi.txt
from the provided URL. You’ll see this command again later in the context of networking commands. You can check to see if you successfully downloaded the usdoi.txt
by using the ls
command.
Similar to curl, but it is used to retrieve files located at a URL, or the HTML code for a webpage
- It is more useful than curl in that it supports recursive file downloads, in the event the webpage contains several files to download
- It provides information while it is in the process of downloading
-P
-P will set the directory to direct the output to
# We are using Linux in WSL so we have to add the /mnt at the start to access the file in the windows system
# Change to directory
# Create new directory within it
~$ cd /mnt/d/data/Linux_projects/Final_projects
~$ ls
C6_M4~$ mkdir C7
~$ ls
C6_M4 C7# Download the script to the new directory
$ wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0231EN-SkillsNetwork/datasets/World/world_mysql_script.sql -P C7
# OUTPUT
--2024-09-30 16:52:31-- https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0231EN-SkillsNetwork/datasets/World/world_mysql_script.sql
-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 198.23.119.245
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|198.23.119.245|:443... connected.
Connecting to cf200 OK
HTTP request sent, awaiting response... : 389702 (381K) [application/x-sql]
Length: ‘C7/world_mysql_script.sql’
Saving to
100%[=================================================>] 380.57K 1.26MB/s in 0.3s
world_mysql_script.sql
2024-09-30 16:52:33 (1.26 MB/s) - ‘C7/world_mysql_script.sql’ saved [389702/389702]
curl
curl is a command-line utility for transferring data from or to a server designed to work without user interaction. With curl, you can download or upload data using one of the supported protocols including HTTP, HTTPS, SCP , SFTP , and FTP .
curl
provides a number of options allowing you to resume transfers, limit the bandwidth, proxy support, user authentication, and much more.- Syntax:
curl [options] [URL...]
- If you want to extract the source code for a web page just use
curl blah.com
and it will print the source code - If no protocol is specified it will guess it and default to
-o
lower case - saves the output to a filename:curl -o filename https://blah.com/santa/whogoes/there.js
-O
Upper case - saves the file with its original filename- Multiple files can be downloaded by using the
-O
Upper case paired with each origin location -C
is used to resume a download that was interrrupted-I
to fetch the HTTP headers only-L
follows the redirects-A
to change the User-Agent-- limit-rate
allows you to limit the data transfer rate, in bytes, kbytes (k), megabytes (m) and gigabytes (g)-u
to access a protected FTP server with username and password-T
is used to upload a file to the FTP server
-O http://~/archlinux-2018.06.01-x86_64.iso \
curl -O https://~/debian-9.4.0-amd64-netinst.iso
# ___ If your connection drops you can resume the download with -C
-C https://santa/blah.tz
curl
# ___ Fetch the HTTP headers only
-I --http2 https://www.santa.claus.com/
curl
# ___ Access a protected FTP with Us & Pw
-u FTP_USERNAME:FTP_PASSWORD ftp://ftp.santa.claus.com/
curl # ___ Uou can download a single file with
-u FTP_USERNAME:FTP_PASSWORD ftp://ftp.santa.claus.com/whereismyfile.tar.gz
curl
# ___ Upload file to FTP server
-T thisfile.tar.gz -u FTP_USERNAME:FTP_PASSWORD ftp://ftp.santa.claus.com/
curl
# ___ Extract the entire landing page use
curl www.google.com
# ___ Extract and save the content into a new file
-o file1.txt
curl www.google.com -n 1 file1.txt # to view the head head
cut
cut is used to extract one or more columns of data. It can be used with csv or tabular text data files
- cuts
-f
input by column or field-d ,
(uses , as field delimiter) - cut -f 1,3,5 -d, filename (extracts columns 1,3, 5 separated by ,)
- cut -f 5-25, 32, 69-96 -d, filename (extracts ranges and columns)
- cut -d’|’ -f 5 filename (extract from a pipe-delimited file), it cust the line at the | and returns the 5th field
- cut -b -12 filename (
-b
cuts by byte position instead of field, extracts the first 12 bytes) - cut -c 2-7 filename (
-c
cuts by character ) starting with ch 2 to ch 7 in each line - cut -f -d <> filename | cut -f -d<> (used to extract object between the two <delimiters>
# --- Let's say a web server log looks like this:
127.0.0.1 - john [10/Oct/2019:13:55:36 -0700] "GET /home.html HTTP/1.0" 200 2326
# ___ Extract the request timestamps |> [timestamp]
-d'[' -f2 filename | cut -d']' -f1 cut
# ___ Extract column 5-7 with tab delimiter
-f5-7 -d$'\t' infile filename --output-delimiter=","
cut # ___ Can do this instead: since tab is the default for -d we can omit it
-f5-7 filename --output-delimiter="," cut
# ___ Convert tab delimiter to something else then cut it and save it elsewhere
-f5-7 filename |tr '\t' ',' > destinationfilename
cut
# ___ To extract the last name from a first and last name line seperated by " "=delimiter
-d ' ' -f2 file1.txt
cut # It will seperate the line at the space and extract the 2nd field
Code | Description |
---|---|
-f | cut by column or field |
-d < delimiter > | set the <delimiter> |
-b | cut by byte |
-c | cut by character |
- - complement | cut except |
-d <> filename | -d <> | cut object between the two <delimiters> |
cut - - complement
you can invert the cut with cut
- - complement
it means instead of extracting the column you exclude it from extraction. Take everything but.
cut - - complement
-f 2,4 filename (takes all fields except 2 and 4)cut - - complement
-b -12 filename (takes all bytes except first 12)- we can cut the first 10, middle 10, and last 10 characters on each line using (if we know the max length is 32)
-c 1-10,12-21,23-32 data.txt cut
grep
grep
is used to search text for patterns (global regular expression print) returns lines in file matching a pattern
- If you want to search for pattern “ch” use
grep ch filename
- If you want to include a mixture of upper and lower case matches use -i to make it case insensitive
# ___ Here we search airflow/dags list of files for "string" with | grep
|grep "Bash_ETL_Server_Log_Processing"
airflow dags list
# To find lines that contain the pattern "ch"
grep ch file1.txt
# To find lines that contain case insensitive "ch"
-i ch file1.txt
grep
# To print lines which DO NOT contain a pattern use -v
-v login /etc/passwd # prints all lines not containing login grep
Some frequently used options for grep
include:
Option | Description |
---|---|
-n |
Along with the matching lines, also print the line numbers |
-c |
Get the count of matching lines |
-i |
Ignore the case of the text while matching |
-v |
Print all lines which do not contain the pattern |
-w |
Match only if the pattern matches whole words |
touch
touch
is used to create a new file, and can also be used for changing the timestamps on files and directories. You can create as many files as you want with a single command
- Common options:
-a -m -r -d
cat
cat
can be used to display, create or combine copies of files, as well as print the entire content of a file
- create
- read
- concatenate files, it attaches one file to the end of another. In other words, it will vertically merge files (similar to rbind)
- Common options:
-a
all files and folders, including hidden ones and start with.
-b
number non-blank output lines-G
enable colorized output-l
list in long format-n
number all output lines-s
squeeze multiple adjacent blank lines-v
display nonprinting characters, except for tabs and the end of line chars
# ___ Print entire content of file
$ cat file1.txt
# ___ Concatenate two files
$ cat file1.txt file2.txt
more
- prints contents of file in a page by page format, as opposed to cat that prints the entire content and you are limited on what you can view based on your terminal space settings.
- Hitting the space bar will scroll to the next page
$ more file1.txt
less
Just like more
, the less
command displays the first page of the file. What’s useful about less
is that you can use it to move around the file, page by page, using the Page Up
and Page Down
keys.
You can also scroll up and down through the file line-by-line, using the Up Arrow
and Down Arrow
keys, ↑ and ↓.
Unlike more
, less
does not automatically exit when you reach the end of a file, allowing you the option to continue scrolling around. You can quit at any time by typing q
.
sort
- To sort your views line by line in ascending alphabetical order
sort file1.txt
- To sort in descending order include: -r like this:
sort -r file1.txt
- suppose you have the file pets.txt
$ cat pets.txt
goldfish
dog
cat
parrot
dog
$ sort pets.txt
cat
dog
dog
goldfish
goldfish
goldfish parrot
uniq
- Filter out repeated lines and display unique lines use:
uniq file1.txt
- It removes repeated lines if they are consecutive
- So in reality it is not a list of unique lines, be careful how and why use it
- So in essence it drops consecutive duplicate lines and show results
- The
uniq
line will drop any lines in the file that are identical and consecutive. This is similar to what is known as “dropping duplicates”, there can still be duplicated lines left over if these lines are not repeated right after the other. ## paste (default)
Merges several input files to produce a new delimited text file. The default mode is parallel with a TAB delimiter, which is similar to cbind in R. It is most useful to provide it with a delimiter. So input file’s data will serve as columns in the output (cbind)
-d
set the delimiter other than the default TAB
$ uniq pets.txt
goldfish
dog
cat
parrot
dog
goldfish
# Combine sort & uniq
$ sort pets.txt | uniq
cat
dog
goldfish parrot
paste (serial)
What’s different from the default action is this: let’s say we have two files each with one column and 5 lines/rows of data. Default/parallel will paste to a two 2cX5r output (cbind). Serial mode will take the 5 lines/rows of file1 and create 5 columns in the output file. Then takes the 5 lines/rows of file2 and places after the first row/line that was created by file1. So now we’ll have a 5cX2r output.
-s
serial mode
NOTE: In effect if we wanted to join all the lines of each column in file1 we could use paste -s, because it takes all the lines of the first column and joins them together into the first line of the output. Then takes all the lines of the second field/column and joins them into the second line of the output file.
# ___ Past files together and leave as default (which is tab delimiter)
paste file1.txt file2.txt file3.txt# OUTPUT
2345
Alan Blah 9083
Bob Dingo 8374
Stinky Wingo
# ___ Paste three files and save result to new file, set delimiter to ,
-d "," csv_data.csv tsv_data.csv fixed_width_data.csv > extracted_data.csv
paste
# ___ Past two files and look at the first 3 lines, set delimiter to ;
-d ';' file1.csv file2.csv | head -3 paste
head
The
head
command displays the first 10 lines of a file. You can also set the number of lines you wish to view by utilizing the-n
option
-n N
prints out the first N lines of the file(s)-q
doesn’t print out the headers-v
always prints out the file headers
# ___ Print out the first 10 lines
head filename
# ___ Print out the first 7 lines
-n 7 filename
head
# ___ Print out the first 5 lines EXCLUDING headers of file1 followed by first 5 lines of file2
-q -n 5 file1 file2 head
tail
Similar to head except it displays the last 10 lines of a file
-n N
prints out the last N lines of the file(s)
- 3 file1.txt # for first 3 lines
head -3 file1.txt # for last 3 lines tail
wc
word count: counts the characters, words, lines in your file
- to view only line count use: -l
- to view only word count use: -w
- to view only character count use: -c
$ cat pets.txt
# OUTPUT
cat
cat
cat
cat
dog
dog
cat
$ wc pets.txt
# OUTPUT
7 7 28 pets.txt
# this means your file contains:
7 lines, 7 words, 28 characters it counts new line characters
# lines only
$ wc -l pets.txt
# words only
$ wc -w pets.txt
# chars only
$ wc -c pets.txt
mv
Moves files and folders. First arg is the file to be moved, second arg the destination
-f
to force move and overwrite files without checking with user-i
to prompt confirmation before overwriting files
cp
Copy files and directories. If destination file exists it will overwrite it
-b
creates a backup of the destination file in the same folder with different name and format-f
forces copying, even if the user lacks writing permission, deletes destination if necessary-i
interactive copying with a warning before overwriting-r or -R
recursive copying for directories-p
preserves file characteristics (mod time, access time, ownership, permission)'*'
uses the * wildcard to represent all files and directories matching a pattern
# ___ Copy content of file1 into a new/old file file2
cp file1 file2
# ___ Copy multiple files to a directory, directory is created if doesn't exist
/blah/
cp file1 file2 file3 xx
# ___ Copy an entire directory to another directory
-R directory1 directory2 cp
mkdir
Used to create a directory
-- help
displays help related information-- version
displays version number & other information of mkdir-v
enables verbose mode, displaying a message for every directory created- -p used to create parent directory if necessary. If the specified directories already exist, no error is reported
-m
sets the file modes or permissions for the created directorie(s)
# ___ Create directory third a sub of both first and second
-p first/second/third mkdir
File Directory
list - ls
ls & list
will lit all the files in the directory. We can skip the / between directory/sub directory (airflow/dags) and use
- Common options:
-a -l
airflow dags list
ls |
List all the files in a directory |
ls -l |
List all files and their details (owner, mtime, size, etc) |
ls -a |
List all the files in a directory (including hidden files) |
pwd |
Show the present working directory |
cd |
Change directory to some other location |
file |
View the type of any file |
Search for Files
Command | Description |
---|---|
locate |
Quickly find a file or directory that has been cached |
find |
Seach for a file or directory based on name and other parameters |
Summary
Command | Description |
---|---|
mkdir |
Create a new directory |
touch |
Create a new, empty file, or update the modified time of an existing one |
cat > file |
Create a new file with the text you type after |
cat file |
View the contents of a file |
grep |
View the contents of a file that match a pattern |
nano file |
Open a file (or create new one) in nano text editor |
vim file |
Open a file (or create new one) in vim text editor |
rm or rmdir |
Remove a file or empty directory |
rm -r |
Remove a directory that isn’t empty |
mv |
Move or rename a file or directory |
cp |
Copy a file or directory |
rsync |
Synchronize the changes of one directory to another |