File Manipulation

Handle Files

wget

The wget command downloads a text file called usdoi.txt from the provided URL. You’ll see this command again later in the context of networking commands. You can check to see if you successfully downloaded the usdoi.txt by using the ls command.

Similar to curl, but it is used to retrieve files located at a URL, or the HTML code for a webpage

It is more useful than curl in that it supports recursive file downloads, in the event the webpage contains several files to download
It provides information while it is in the process of downloading

$ wget https://www.yourdataiq.com/TT/Data/Blah/dingo.txt

-P

-P will set the directory to direct the output to

# We are using Linux in WSL so we have to add the /mnt at the start to access the file in the windows system
# Change to directory
# Create new directory within it
~$ cd /mnt/d/data/Linux_projects/Final_projects
~$ ls
C6_M4
~$ mkdir C7
~$ ls
C6_M4  C7
# Download the script to the new directory

$ wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0231EN-SkillsNetwork/datasets/World/world_mysql_script.sql -P C7

# OUTPUT
--2024-09-30 16:52:31--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0231EN-SkillsNetwork/datasets/World/world_mysql_script.sql
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 198.23.119.245
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|198.23.119.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 389702 (381K) [application/x-sql]
Saving to: ‘C7/world_mysql_script.sql’

world_mysql_script.sql        100%[=================================================>] 380.57K  1.26MB/s    in 0.3s

2024-09-30 16:52:33 (1.26 MB/s) - ‘C7/world_mysql_script.sql’ saved [389702/389702]

curl

curl is a command-line utility for transferring data from or to a server designed to work without user interaction. With curl, you can download or upload data using one of the supported protocols including HTTP, HTTPS, SCP , SFTP , and FTP .

curl provides a number of options allowing you to resume transfers, limit the bandwidth, proxy support, user authentication, and much more.
Syntax: curl [options] [URL...]
If you want to extract the source code for a web page just use curl blah.com and it will print the source code
If no protocol is specified it will guess it and default to
-o lower case - saves the output to a filename: curl -o filename https://blah.com/santa/whogoes/there.js
-O Upper case - saves the file with its original filename
Multiple files can be downloaded by using the -O Upper case paired with each origin location
-C is used to resume a download that was interrrupted
-I to fetch the HTTP headers only
-L follows the redirects
-A to change the User-Agent
-- limit-rate allows you to limit the data transfer rate, in bytes, kbytes (k), megabytes (m) and gigabytes (g)
-u to access a protected FTP server with username and password
-T is used to upload a file to the FTP server

curl -O http://~/archlinux-2018.06.01-x86_64.iso  \
     -O https://~/debian-9.4.0-amd64-netinst.iso

# ___  If your connection drops you can resume the download with -C
curl -C https://santa/blah.tz

# ___  Fetch the HTTP headers only
curl -I --http2 https://www.santa.claus.com/
        
# ___  Access a protected FTP with Us & Pw
curl -u FTP_USERNAME:FTP_PASSWORD ftp://ftp.santa.claus.com/
# ___  Uou can download a single file with
curl -u FTP_USERNAME:FTP_PASSWORD ftp://ftp.santa.claus.com/whereismyfile.tar.gz

# ___  Upload file to FTP server
curl -T thisfile.tar.gz  -u FTP_USERNAME:FTP_PASSWORD ftp://ftp.santa.claus.com/
        
# ___ Extract the entire landing page use
curl www.google.com

# ___ Extract and save the content into a new file
curl www.google.com -o file1.txt
head -n 1 file1.txt   # to view the head

cut

cut is used to extract one or more columns of data. It can be used with csv or tabular text data files

cuts -f input by column or field -d , (uses , as field delimiter)
cut -f 1,3,5 -d, filename (extracts columns 1,3, 5 separated by ,)
cut -f 5-25, 32, 69-96 -d, filename (extracts ranges and columns)
cut -d’|’ -f 5 filename (extract from a pipe-delimited file), it cust the line at the | and returns the 5th field
cut -b -12 filename (-b cuts by byte position instead of field, extracts the first 12 bytes)
cut -c 2-7 filename ( -c cuts by character ) starting with ch 2 to ch 7 in each line
cut -f -d <> filename | cut -f -d<> (used to extract object between the two <delimiters>

# --- Let's say a web server log looks like this:
127.0.0.1 - john [10/Oct/2019:13:55:36 -0700] "GET /home.html HTTP/1.0" 200 2326

# ___  Extract the request timestamps |> [timestamp]
cut -d'[' -f2 filename | cut -d']' -f1

# ___  Extract column 5-7 with tab delimiter
cut -f5-7 -d$'\t' infile filename --output-delimiter=","
# ___  Can do this instead: since tab is the default for -d we can omit it
cut -f5-7 filename --output-delimiter=","

# ___  Convert tab delimiter to something else then cut it and save it elsewhere
cut -f5-7 filename |tr '\t' ',' > destinationfilename

# ___ To extract the last name from a first and last name line seperated by " "=delimiter
cut -d ' ' -f2 file1.txt
# It will seperate the line at the space and extract the 2nd field

Code	Description
-f	cut by column or field
-d < delimiter >	set the <delimiter>
-b	cut by byte
-c	cut by character
- - complement	cut except
-d <> filename \| -d <>	cut object between the two <delimiters>

cut - - complement

you can invert the cut with cut - - complement it means instead of extracting the column you exclude it from extraction. Take everything but.

cut - - complement -f 2,4 filename (takes all fields except 2 and 4)
cut - - complement -b -12 filename (takes all bytes except first 12)
we can cut the first 10, middle 10, and last 10 characters on each line using (if we know the max length is 32)

cut -c 1-10,12-21,23-32 data.txt

grep

grep is used to search text for patterns (global regular expression print) returns lines in file matching a pattern

If you want to search for pattern “ch” use grep ch filename
If you want to include a mixture of upper and lower case matches use -i to make it case insensitive

# ___  Here we search airflow/dags list of files for "string" with | grep
airflow dags list|grep "Bash_ETL_Server_Log_Processing"

# To find lines that contain the pattern "ch"
grep ch file1.txt

# To find lines that contain case insensitive "ch"
grep -i ch file1.txt

# To print lines which DO NOT contain a pattern use -v
grep -v login /etc/passwd   # prints all lines not containing login

Some frequently used options for grep include:

Option	Description
`-n`	Along with the matching lines, also print the line numbers
`-c`	Get the count of matching lines
`-i`	Ignore the case of the text while matching
`-v`	Print all lines which do not contain the pattern
`-w`	Match only if the pattern matches whole words

touch

touch is used to create a new file, and can also be used for changing the timestamps on files and directories. You can create as many files as you want with a single command

Common options: -a -m -r -d

cat

cat can be used to display, create or combine copies of files, as well as print the entire content of a file

create
read
concatenate files, it attaches one file to the end of another. In other words, it will vertically merge files (similar to rbind)
Common options:
- -a all files and folders, including hidden ones and start with .
- -b number non-blank output lines
- -G enable colorized output
- -l list in long format
- -n number all output lines
- -s squeeze multiple adjacent blank lines
- -v display nonprinting characters, except for tabs and the end of line chars

# ___ Print entire content of file
$ cat file1.txt

# ___  Concatenate two files
$ cat file1.txt file2.txt

prints contents of file in a page by page format, as opposed to cat that prints the entire content and you are limited on what you can view based on your terminal space settings.
Hitting the space bar will scroll to the next page

$ more file1.txt

less

Just like more, the less command displays the first page of the file. What’s useful about less is that you can use it to move around the file, page by page, using the Page Up and Page Down keys.

You can also scroll up and down through the file line-by-line, using the Up Arrow and Down Arrow keys, ↑ and ↓.

Unlike more, less does not automatically exit when you reach the end of a file, allowing you the option to continue scrolling around. You can quit at any time by typing q.

sort

To sort your views line by line in ascending alphabetical order sort file1.txt
To sort in descending order include: -r like this: sort -r file1.txt
suppose you have the file pets.txt

$ cat pets.txt
goldfish
dog
cat
parrot
dog

$ sort pets.txt
cat
dog
dog
goldfish
goldfish
goldfish
parrot

uniq

Filter out repeated lines and display unique lines use: uniq file1.txt
It removes repeated lines if they are consecutive
So in reality it is not a list of unique lines, be careful how and why use it
So in essence it drops consecutive duplicate lines and show results
The uniq line will drop any lines in the file that are identical and consecutive. This is similar to what is known as “dropping duplicates”, there can still be duplicated lines left over if these lines are not repeated right after the other. ## paste (default)

Merges several input files to produce a new delimited text file. The default mode is parallel with a TAB delimiter, which is similar to cbind in R. It is most useful to provide it with a delimiter. So input file’s data will serve as columns in the output (cbind)

-d set the delimiter other than the default TAB

$ uniq pets.txt
goldfish
dog
cat
parrot
dog
goldfish

# Combine sort & uniq
$ sort pets.txt | uniq
cat
dog
goldfish
parrot

paste (serial)

What’s different from the default action is this: let’s say we have two files each with one column and 5 lines/rows of data. Default/parallel will paste to a two 2cX5r output (cbind). Serial mode will take the 5 lines/rows of file1 and create 5 columns in the output file. Then takes the 5 lines/rows of file2 and places after the first row/line that was created by file1. So now we’ll have a 5cX2r output.

-s serial mode

NOTE: In effect if we wanted to join all the lines of each column in file1 we could use paste -s, because it takes all the lines of the first column and joins them together into the first line of the output. Then takes all the lines of the second field/column and joins them into the second line of the output file.

# ___  Past files together and leave as default (which is tab delimiter)
paste file1.txt file2.txt file3.txt
# OUTPUT
Alan    Blah    2345
Bob     Dingo   9083
Stinky  Wingo   8374

# ___  Paste three files and save result to new file, set delimiter to ,
paste -d "," csv_data.csv tsv_data.csv fixed_width_data.csv > extracted_data.csv

# ___  Past two files and look at the first 3 lines, set delimiter to ;
paste -d ';' file1.csv file2.csv | head -3

head

The head command displays the first 10 lines of a file. You can also set the number of lines you wish to view by utilizing the -n option

-n N prints out the first N lines of the file(s)
-q doesn’t print out the headers
-v always prints out the file headers

# ___  Print out the first 10 lines
head filename

# ___  Print out the first 7 lines
head -n 7 filename

# ___  Print out the first 5 lines EXCLUDING headers of file1 followed by first 5 lines of file2
head -q -n 5 file1 file2

tail

Similar to head except it displays the last 10 lines of a file

-n N prints out the last N lines of the file(s)

head - 3 file1.txt  # for first 3 lines
tail -3 file1.txt   # for last 3 lines

wc

word count: counts the characters, words, lines in your file

to view only line count use: -l
to view only word count use: -w
to view only character count use: -c

$ cat pets.txt
# OUTPUT
cat
cat
cat
cat
dog
dog
cat

$ wc pets.txt
# OUTPUT
7 7 28 pets.txt
# this means your file contains:
7 lines, 7 words, 28 characters it counts new line characters

# lines only
$ wc -l pets.txt
# words only
$ wc -w pets.txt
# chars only
$ wc -c pets.txt

mv

Moves files and folders. First arg is the file to be moved, second arg the destination

-f to force move and overwrite files without checking with user
-i to prompt confirmation before overwriting files

cp

Copy files and directories. If destination file exists it will overwrite it

-b creates a backup of the destination file in the same folder with different name and format
-f forces copying, even if the user lacks writing permission, deletes destination if necessary
-i interactive copying with a warning before overwriting
-r or -R recursive copying for directories
-p preserves file characteristics (mod time, access time, ownership, permission)
'*' uses the * wildcard to represent all files and directories matching a pattern

# ___  Copy content of file1 into a new/old file file2
cp file1 file2

# ___  Copy multiple files to a directory, directory is created if doesn't exist
cp file1 file2 file3 xx/blah/
        
# ___  Copy an entire directory to another directory
cp -R directory1 directory2

mkdir

Used to create a directory

-- help displays help related information
-- version displays version number & other information of mkdir
-v enables verbose mode, displaying a message for every directory created
-p used to create parent directory if necessary. If the specified directories already exist, no error is reported
-m sets the file modes or permissions for the created directorie(s)

# ___  Create directory third a sub of both first and second
mkdir -p first/second/third

File Directory

list - ls

ls & list will lit all the files in the directory. We can skip the / between directory/sub directory (airflow/dags) and use

Common options: -a -l

airflow dags list

`ls`	List all the files in a directory
`ls -l`	List all files and their details (owner, mtime, size, etc)
`ls -a`	List all the files in a directory (including hidden files)
`pwd`	Show the present working directory
`cd`	Change directory to some other location
`file`	View the type of any file

Search for Files

Command	Description
`locate`	Quickly find a file or directory that has been cached
`find`	Seach for a file or directory based on name and other parameters

Summary

Command	Description
`mkdir`	Create a new directory
`touch`	Create a new, empty file, or update the modified time of an existing one
`cat > file`	Create a new file with the text you type after
`cat file`	View the contents of a file
`grep`	View the contents of a file that match a pattern
`nano file`	Open a file (or create new one) in nano text editor
`vim file`	Open a file (or create new one) in vim text editor
`rm or rmdir`	Remove a file or empty directory
`rm -r`	Remove a directory that isn’t empty
`mv`	Move or rename a file or directory
`cp`	Copy a file or directory
`rsync`	Synchronize the changes of one directory to another