File Manipulation

Handle Files


wget

The wget command downloads a text file called usdoi.txt from the provided URL. You’ll see this command again later in the context of networking commands. You can check to see if you successfully downloaded the usdoi.txt by using the ls command.

Similar to curl, but it is used to retrieve files located at a URL, or the HTML code for a webpage

  • It is more useful than curl in that it supports recursive file downloads, in the event the webpage contains several files to download
  • It provides information while it is in the process of downloading
$ wget https://www.yourdataiq.com/TT/Data/Blah/dingo.txt

-P

-P will set the directory to direct the output to

# We are using Linux in WSL so we have to add the /mnt at the start to access the file in the windows system
# Change to directory
# Create new directory within it
~$ cd /mnt/d/data/Linux_projects/Final_projects
~$ ls
C6_M4
~$ mkdir C7
~$ ls
C6_M4  C7
# Download the script to the new directory

$ wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0231EN-SkillsNetwork/datasets/World/world_mysql_script.sql -P C7

# OUTPUT
--2024-09-30 16:52:31--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0231EN-SkillsNetwork/datasets/World/world_mysql_script.sql
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 198.23.119.245
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|198.23.119.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 389702 (381K) [application/x-sql]
Saving to: ‘C7/world_mysql_script.sql’

world_mysql_script.sql        100%[=================================================>] 380.57K  1.26MB/s    in 0.3s

2024-09-30 16:52:33 (1.26 MB/s) - ‘C7/world_mysql_script.sql’ saved [389702/389702]

curl

curl is a command-line utility for transferring data from or to a server designed to work without user interaction. With curl, you can download or upload data using one of the supported protocols including HTTP, HTTPS, SCP , SFTP , and FTP . 

  • curl provides a number of options allowing you to resume transfers, limit the bandwidth, proxy support, user authentication, and much more.
  • Syntax: curl [options] [URL...]
  • If you want to extract the source code for a web page just use curl blah.com and it will print the source code
  • If no protocol is specified it will guess it and default to
  • -o lower case - saves the output to a filename: curl -o filename https://blah.com/santa/whogoes/there.js
  • -O Upper case - saves the file with its original filename
  • Multiple files can be downloaded by using the -O Upper case paired with each origin location
  • -C is used to resume a download that was interrrupted
  • -I to fetch the HTTP headers only
  • -L follows the redirects
  • -A to change the User-Agent
  • -- limit-rate allows you to limit the data transfer rate, in bytes, kbytes (k), megabytes (m) and gigabytes (g)
  • -u to access a protected FTP server with username and password
  • -T is used to upload a file to the FTP server
curl -O http://~/archlinux-2018.06.01-x86_64.iso  \
     -O https://~/debian-9.4.0-amd64-netinst.iso

# ___  If your connection drops you can resume the download with -C
curl -C https://santa/blah.tz

# ___  Fetch the HTTP headers only
curl -I --http2 https://www.santa.claus.com/
        
# ___  Access a protected FTP with Us & Pw
curl -u FTP_USERNAME:FTP_PASSWORD ftp://ftp.santa.claus.com/
# ___  Uou can download a single file with
curl -u FTP_USERNAME:FTP_PASSWORD ftp://ftp.santa.claus.com/whereismyfile.tar.gz

# ___  Upload file to FTP server
curl -T thisfile.tar.gz  -u FTP_USERNAME:FTP_PASSWORD ftp://ftp.santa.claus.com/
        
# ___ Extract the entire landing page use
curl www.google.com

# ___ Extract and save the content into a new file
curl www.google.com -o file1.txt
head -n 1 file1.txt   # to view the head

cut

cut is used to extract one or more columns of data. It can be used with csv or tabular text data files

  • cuts -f input by column or field -d , (uses , as field delimiter)
  • cut -f 1,3,5 -d, filename (extracts columns 1,3, 5 separated by ,)
  • cut -f 5-25, 32, 69-96 -d, filename (extracts ranges and columns)
  • cut -d’|’ -f 5 filename (extract from a pipe-delimited file), it cust the line at the | and returns the 5th field
  • cut -b -12 filename (-b cuts by byte position instead of field, extracts the first 12 bytes)
  • cut -c 2-7 filename ( -c cuts by character ) starting with ch 2 to ch 7 in each line
  • cut -f -d <> filename | cut -f -d<> (used to extract object between the two <delimiters>
# --- Let's say a web server log looks like this:
127.0.0.1 - john [10/Oct/2019:13:55:36 -0700] "GET /home.html HTTP/1.0" 200 2326

# ___  Extract the request timestamps |> [timestamp]
cut -d'[' -f2 filename | cut -d']' -f1
# ___  Extract column 5-7 with tab delimiter
cut -f5-7 -d$'\t' infile filename --output-delimiter=","
# ___  Can do this instead: since tab is the default for -d we can omit it
cut -f5-7 filename --output-delimiter=","
# ___  Convert tab delimiter to something else then cut it and save it elsewhere
cut -f5-7 filename |tr '\t' ',' > destinationfilename

# ___ To extract the last name from a first and last name line seperated by " "=delimiter
cut -d ' ' -f2 file1.txt
# It will seperate the line at the space and extract the 2nd field
Code Description
-f cut by column or field
-d < delimiter > set the <delimiter>
-b cut by byte
-c cut by character
- - complement cut except
-d <> filename | -d <> cut object between the two <delimiters>

cut - - complement

you can invert the cut with cut - - complement it means instead of extracting the column you exclude it from extraction. Take everything but.

  • cut - - complement -f 2,4 filename (takes all fields except 2 and 4)
  • cut - - complement -b -12 filename (takes all bytes except first 12)
  • we can cut the first 10, middle 10, and last 10 characters on each line using (if we know the max length is 32)
cut -c 1-10,12-21,23-32 data.txt

grep

grep is used to search text for patterns (global regular expression print) returns lines in file matching a pattern

  • If you want to search for pattern “ch” use grep ch filename
  • If you want to include a mixture of upper and lower case matches use -i to make it case insensitive
# ___  Here we search airflow/dags list of files for "string" with | grep
airflow dags list|grep "Bash_ETL_Server_Log_Processing"

# To find lines that contain the pattern "ch"
grep ch file1.txt

# To find lines that contain case insensitive "ch"
grep -i ch file1.txt

# To print lines which DO NOT contain a pattern use -v
grep -v login /etc/passwd   # prints all lines not containing login

Some frequently used options for grep include:

Option Description
-n Along with the matching lines, also print the line numbers
-c Get the count of matching lines
-i Ignore the case of the text while matching
-v Print all lines which do not contain the pattern
-w Match only if the pattern matches whole words

touch

touch is used to create a new file, and can also be used for changing the timestamps on files and directories. You can create as many files as you want with a single command

  • Common options: -a -m -r -d

cat

cat can be used to display, create or combine copies of files, as well as print the entire content of a file

  • create
  • read
  • concatenate files, it attaches one file to the end of another. In other words, it will vertically merge files (similar to rbind)
  • Common options:
    • -a all files and folders, including hidden ones and start with .
    • -b number non-blank output lines
    • -G enable colorized output
    • -l list in long format
    • -n number all output lines
    • -s squeeze multiple adjacent blank lines
    • -v display nonprinting characters, except for tabs and the end of line chars
# ___ Print entire content of file
$ cat file1.txt

# ___  Concatenate two files
$ cat file1.txt file2.txt

more

  • prints contents of file in a page by page format, as opposed to cat that prints the entire content and you are limited on what you can view based on your terminal space settings.
  • Hitting the space bar will scroll to the next page
$ more file1.txt

less

Just like more, the less command displays the first page of the file. What’s useful about less is that you can use it to move around the file, page by page, using the Page Up and Page Down keys.

You can also scroll up and down through the file line-by-line, using the Up Arrow and Down Arrow keys, ↑ and ↓.

Unlike more, less does not automatically exit when you reach the end of a file, allowing you the option to continue scrolling around. You can quit at any time by typing q.

sort

  • To sort your views line by line in ascending alphabetical order sort file1.txt
  • To sort in descending order include: -r like this: sort -r file1.txt
  • suppose you have the file pets.txt
$ cat pets.txt
goldfish
dog
cat
parrot
dog

$ sort pets.txt
cat
dog
dog
goldfish
goldfish
goldfish
parrot

uniq

  • Filter out repeated lines and display unique lines use: uniq file1.txt
  • It removes repeated lines if they are consecutive
  • So in reality it is not a list of unique lines, be careful how and why use it
  • So in essence it drops consecutive duplicate lines and show results
  • The uniq line will drop any lines in the file that are identical and consecutive. This is similar to what is known as “dropping duplicates”, there can still be duplicated lines left over if these lines are not repeated right after the other. ## paste (default)

Merges several input files to produce a new delimited text file. The default mode is parallel with a TAB delimiter, which is similar to cbind in R. It is most useful to provide it with a delimiter. So input file’s data will serve as columns in the output (cbind)

  • -d set the delimiter other than the default TAB
$ uniq pets.txt
goldfish
dog
cat
parrot
dog
goldfish

# Combine sort & uniq
$ sort pets.txt | uniq
cat
dog
goldfish
parrot

paste (serial)

What’s different from the default action is this: let’s say we have two files each with one column and 5 lines/rows of data. Default/parallel will paste to a two 2cX5r output (cbind). Serial mode will take the 5 lines/rows of file1 and create 5 columns in the output file. Then takes the 5 lines/rows of file2 and places after the first row/line that was created by file1. So now we’ll have a 5cX2r output.

  • -s serial mode

NOTE: In effect if we wanted to join all the lines of each column in file1 we could use paste -s, because it takes all the lines of the first column and joins them together into the first line of the output. Then takes all the lines of the second field/column and joins them into the second line of the output file.

# ___  Past files together and leave as default (which is tab delimiter)
paste file1.txt file2.txt file3.txt
# OUTPUT
Alan    Blah    2345
Bob     Dingo   9083
Stinky  Wingo   8374

# ___  Paste three files and save result to new file, set delimiter to ,
paste -d "," csv_data.csv tsv_data.csv fixed_width_data.csv > extracted_data.csv

# ___  Past two files and look at the first 3 lines, set delimiter to ;
paste -d ';' file1.csv file2.csv | head -3

tail

Similar to head except it displays the last 10 lines of a file

  • -n N prints out the last N lines of the file(s)
head - 3 file1.txt  # for first 3 lines
tail -3 file1.txt   # for last 3 lines

wc

word count: counts the characters, words, lines in your file

  • to view only line count use: -l
  • to view only word count use: -w
  • to view only character count use: -c
$ cat pets.txt
# OUTPUT
cat
cat
cat
cat
dog
dog
cat

$ wc pets.txt
# OUTPUT
7 7 28 pets.txt
# this means your file contains:
7 lines, 7 words, 28 characters it counts new line characters

# lines only
$ wc -l pets.txt
# words only
$ wc -w pets.txt
# chars only
$ wc -c pets.txt

mv

Moves files and folders. First arg is the file to be moved, second arg the destination

  • -f to force move and overwrite files without checking with user
  • -i to prompt confirmation before overwriting files

cp

Copy files and directories. If destination file exists it will overwrite it

  • -b creates a backup of the destination file in the same folder with different name and format
  • -f forces copying, even if the user lacks writing permission, deletes destination if necessary
  • -i interactive copying with a warning before overwriting
  • -r or -R recursive copying for directories
  • -p preserves file characteristics (mod time, access time, ownership, permission)
  • '*' uses the * wildcard to represent all files and directories matching a pattern
# ___  Copy content of file1 into a new/old file file2
cp file1 file2

# ___  Copy multiple files to a directory, directory is created if doesn't exist
cp file1 file2 file3 xx/blah/
        
# ___  Copy an entire directory to another directory
cp -R directory1 directory2

mkdir

Used to create a directory

  • -- help displays help related information
  • -- version displays version number & other information of mkdir
  • -v enables verbose mode, displaying a message for every directory created
  • -p used to create parent directory if necessary. If the specified directories already exist, no error is reported
  • -m sets the file modes or permissions for the created directorie(s)
# ___  Create directory third a sub of both first and second
mkdir -p first/second/third

File Directory


list - ls

ls & list will lit all the files in the directory. We can skip the / between directory/sub directory (airflow/dags) and use

  • Common options: -a -l
airflow dags list
ls List all the files in a directory
ls -l List all files and their details (owner, mtime, size, etc)
ls -a List all the files in a directory (including hidden files)
pwd Show the present working directory
cd Change directory to some other location
file View the type of any file

Search for Files


Command Description
locate Quickly find a file or directory that has been cached
find Seach for a file or directory based on name and other parameters

Summary


Command Description
mkdir Create a new directory
touch Create a new, empty file, or update the modified time of an existing one
cat > file Create a new file with the text you type after
cat file View the contents of a file
grep View the contents of a file that match a pattern
nano file Open a file (or create new one) in nano text editor
vim file Open a file (or create new one) in vim text editor
rm or rmdir Remove a file or empty directory
rm -r Remove a directory that isn’t empty
mv Move or rename a file or directory
cp Copy a file or directory
rsync Synchronize the changes of one directory to another