Tutorials - CGI > Unix / Windows text file compatability
Tutorials and FAQs: CGI: Unix / Windows text file compatabilityThis tutorial aims to explain the use of Carriage Return (CR) and Line Feed (LF) characters in text based files, and in particular what happens when you transfer files containing them between Windows and Unix/Linux (*nix) based systems. By text based files I mean HTML web pages, perl, php and other scripts and standard .txt files.
CR and LF are traditionally used as line termination characters to identify when one line finishes and another starts. They are also normally invisible when viewing text files because they are considered control characters - i.e. are interpreted differently to normal printable text (a-z, A-Z, 0-9, !"£$% etc). The main problem is the two systems (Windows and *nix) treat CR and LF differently and the two methods are not always compatible. Windows terminates lines using a CR and LF character (CR followed by LF), *nix terminates lines with just a single LF character and considers CR as part of the normal text file. Because of this difference transferring files between the systems can have unexpected results, and files like php or perl scripts may run perfectly on one system, but fail or generate errors when run on the other without any obvious difference in the file contents. This difference in determining when a line is terminated (CR/LF or LF only) can also affect text data files transferred between the systems. The scripts may read these data files and lines within the data files may be read differently by the operating system the script is run on.
How do the CR/LF differences cause problems
Here is an example that shows the difference between the two systems - the <CR> and <LF> represent the invisible line termination control characters that exist in the file.
First a Windows file called file1.txt:
| This is a line of text<CR><LF> This is a line of text<CR><LF> <CR><LF> This is the final line of text<CR><LF> |
Now the same file in Linux called file2.txt:
| This is a line of text<LF> This is a line of text<LF> <LF> This is the final line of text<LF> |
Ok, now to show what happens when a script is written to read each line in turn from the 2 files detailed above and the windows file was transferred to a Linux system 'as is' i.e. containing CR and LF. Note: normally when a line of text is read, the operating system strips off the line terminating characters and only returns the actual line text to the script.
First reading each line of file2.txt (created on Linux)
| line 1 = 'This is a line of text' line 2 = 'This is a line of text' line 3 = '' line 4 = 'This is a line of text' |
Now reading each line from file1.txt (created on Windows)
| line 1 = 'This is a line of text<CR>' line 2 = 'This is a line of text<CR>' line 3 = '<CR>' line 4 = 'This is a line of text<CR>' |
You will notice the <CR> is actually returned to the script. This is because Linux only strips off what it knows is the line terminating character, LF, and returns all other text including the CR. If the script was not expecting the <CR> to be in the returned text (because it would have been removed if it was run on windows), it may cause the script to fail. Another example may be if the script used the line to match with something else it would match on Windows (because the CR would not be there) but not match on Linux (because the CR is there).
Another example shows what happens to a perl script written on Windows (say in notepad) and transferred to Linux. Perl scripts (like php and shell scripts) are interpreted meaning each line in the script file is read in turn, and it's contents parsed to determine what actions to take. The first line (#) is also important in Linux as it tells the shell what interpreter to use to execute the script.
| #!/usr/bin/perl<CR><LF> use strict;<CR><LF> <CR><LF> if ((! defined $ARGV[0])||($ARGV[0] eq "")) {<CR><LF> print<<_END_;<CR><LF> undos [file]: remove dos line breaks from a script<CR><LF> _END_<CR><LF> exit; <CR><LF> }<CR><LF> |
When run on Windows, as the perl interpreter reads each line in, the CR and LF is stripped off leaving just the command line and the script runs fine. The # line is actually ignored by Windows and just treated as a comment.
When run on Linux, the shell (normally bash) reads the first # line and determines the script should be run by perl. However, because the shell is expecting the script to be in *nix text format the first line actually looks like #!/usr/bin/perl<CR> (perl<carrage return>) and the shell tries to find a programs called perl<CR> which it can't so will fail to run the script.
| username@shellx username $ ./test.pl bash: ./test.pl: No such file or directory |
Looking at this from the other direction, if you wrote a script on Linux (which only contained LF) and transferred it to Windows, it would still run fine even though each line only contained LF as the line terminator. This is because perl is aware of both Linux and Windows text formats so can interpret the lines correctly whether they use LF or CF/LF as the line terminator. The first # is also ignored.
One solution that can be used with perl is to turn on perl's warning mode by using #!/usr/bin/perl -w. This means Perl will be a bit stricter about interpreting your code and warning you when you do silly things, but that's no bad thing and you should be using it anyway. This works because instead of the CR being after perl, it's after the -w option. The shell will look for perl (without a CR) which is will find, and because perl will be interpreting the -w option it will ignore the CR and continue executing the script. For other scripts (shell, php etc) just leaving an extra space after the program may also stop this situation occurring.
How can you solve this compatibility problem
This is actually very easy to do, or more precisely it can be done for you. The most common method for transferring files between systems is to use FTP (File Transfer Protocol). This is a communication program that allows data (files) to be transferred reliably between two systems. FTP has several types or modes of transferring files: Binary, Ascii and sometimes Auto.
- binary - When a file is transferred, all data in the file is transferred unchanged. The source and destination files will be exact copies of each other.
- ascii - When a file is transferred, CR/LF conversion will be performed on the file. This means files copied from Windows to Linux will have the CR character stripped so the remote copy will be in Linux text format. Files copied from Linux to Windows will have a CR inserted before each LF character so the destination copy will be in Windows text format.
- Auto - This mode tries to determine if the file to be transferred is text (i.e. a script) or non-text (i.e. pdf, zip or raw data). How it determines this may be program dependent, some will look at the file contents and others will look at the file extension. If it determines it's text it uses ascii file transfer otherwise it used binary file transfer. Note: this method may not always work and could result in scripts being transferred in binary so no CR/LF conversion will be performed. So use with caution.
How can I correct files with the wrong CR/LF on Linux
What follows is a perl script called undos, that will remove any CR characters from a file. It basically converts a windows format text file using CR+LF line terminators to a Linux format text file using LF line terminators.
| #!/usr/bin/perl -w use strict; if ((! defined $ARGV[0])||($ARGV[0] eq "")) { print<<_END_; undos [file]: remove dos line breaks from a script _END_ exit; } if (!-f $ARGV[0]) { print<<_END_; $ARGV[0] is not a file! _END_ exit; } open (MODIFY, "+<$ARGV[0]") or die "Opening: $@\n"; my @file = <MODIFY>; map { s/\r//g; } (@file); seek (MODIFY, 0, 0) or die "Seeking: $@\n"; print MODIFY @file or die "Printing: $@\n"; truncate (MODIFY, tell(MODIFY)) or die "Truncating: $@\n"; close (MODIFY) or die "Closing: $@\n"; |
To get the file into your cgi space use the following commands at a cgi $ prompt:
| $ wget http://www.tutorialsteam.plus.com/cgi/crlf/undos.gz $ gunzip undos.gz $ chmod 705 undos |
Alternatively highlight and copy perl script text above and write it to a text file on the cgi server. You can do this by creating the file with vi (vi undos), entering insert mode by pressing i, then pasting the copied text into the screen. You then press escape and write the file by typing :wq then set the file permissions using chmod 705 undos as shown above.
You then run it as follows:
| $ ./undos filename |
where filename is the name of the windows text file you want to convert. The result will be a file of the same name with the CR characters removed.
----
This version is based on an original document written by Alex Hudson who gave his permission for it to be used here. The undos script was also written by Alex Hudson with a correction by myself.
Original Article by: petervaughan - Edited by: acarr
