一. What is a CSV file?
Files with .csv (Comma Separated Values) extension represent plain text files that contain records of data with comma separated values. Each line in a CSV file is a new record from the set of records contained in the file. Such files are generated when data transfer is intended from one storage system to another. Since all applications can recognize records separated by comma, import of such data files to database is done very conveniently. Almost all spreadsheet applications such as Microsoft Excel or OpenOffice Calc can import CSV without much effort. Data imported from such files is arranged in cells of a spreadsheet for representation to user.
Brief History
Following are some quick facts about the origin and history of CSV file format.
- 1972 - IBM Fortran (level H extended) compiler supported them under OS/360
- 1978 - List-directed input/output was supported by FORTRAN 77 that used commas and spaces for delimiters
- 2005 - CSV was standardized with RFC4180 as a MIME Content Type.
- 2013 - RFC4180’s deficiencies were handled by a W3C recommendation
- 2015 - W3C made the first drafts of recommendations for CSV-metadata standards, that began as recommendation in December 2015
Convert CSV Files
CSV files can be converted to several different file formats using the applications that can open these files. For example, Microsoft Excel can import data from CSV file format and save it to XLS, XLSX, PDF, TXT, XML and HTML file formats. Similarly, other desktop as well as online services provide the capability to export CSV files to HTML, ODS and RTF.
CSV File Format
CSV file format is known to be specified under RFC4180. It defines any file to be CSV compliant if:
- Each record is located on a separate line, delimited by a line break (CRLF). For example:
- aaa,bbb,ccc CRLF
- zzz,yyy,xxx CRLF
- The last record in the file may or may not have an ending line break. For example:
- aaa,bbb,ccc CRLF
- zzz,yyy,xxx
- There may be an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file (the presence or absence of the header line should be indicated via the optional "header" parameter of this MIME type). For example:
- field_name,field_name,field_name CRLF
- aaa,bbb,ccc CRLF
- zzz,yyy,xxx CRLF
- Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma. For example:
- aaa,bbb,ccc
- Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:\
- "aaa","bbb","ccc" CRLF
- zzz,yyy,xxx
- Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:
- "aaa","b CRLF
- bb","ccc" CRLF
- zzz,yyy,xxx
- If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:
- "aaa","b""bb","ccc"
However, in light of modern usage, the delimiter is not limited to comma only and can be semicolon, tab or spaces as well. Applications such as Microsoft Excel provide option to specify the delimiter character for importing records from a CSV file.
二. 相比于excel
文件
Excel | CSV |
---|---|
It is a binary file that holds information about all the worksheets in a workbook | CSV stands for Comma Separated Values. It is a plain text format with a series of values separated by commas. |
An Excel not only stores data but can also do operations on the data | A CSV file is just a text file, it stores data but does not contain formatting, formulas, macros, etc. It is also known as flat files |
Excel is a spreadsheet that saves files into its own proprietary format viz xls or xlsx | CSV is a format for saving tabular information into a delimited text file with extension .csv |
Files saved in excel cannot be opened or edited by text editors | CSV files can be opened or edited by text editors like notepad |
In data-warehouse, Excel is preferable for detailed standardized schema specification | In data-warehouse, CSV follows a fairly flat, simple schema |
Any programming language library to parse Excel data is generally larger, slower and complicated | Any programming language to parse CSV data is trivial, generating it is extremely easy |
With no clear distinction or separation between numeric values and text, Excel can mess up with your postal codes and credit card numbers with its auto formatting features | CSV is safe and can clearly differentiate between the numeric values and text. CSV does not manipulate data and stores it as-is. |
In Excel, you have to have a start tag and end tag for each column in each row | In CSV, you write column headers only once |
Excel consumes more memory while importing data | Importing CSV files can be much faster, and it also consumes less memory |
Reading large files user is much easier in Excel for the end user. Also, you can have additional functions like selecting individual cells for import, convert dates and time automatically, reading formulas and their results, filters, sorting, etc. | Reading large files in CSV will not be as easier as Excel for the End User |
Apart from text, data can also be stored in form of charts and graphs | Each record is stored as one line of a text file, and every newline signifies a new database row. CSV can not store charts or graphs |
Excel file can be opened with Microsoft Excel doc only | CSV can be opened with any text editor in Windows like notepad, MS Excel, Microsoft Works 9, etc. |
Excel can connect to external data sources to fetch data. You can use custom add-in in Excel to increase its functionality. Excel allows for Review of Data with detailed tracking and commenting feature. | All this functionality is not possible in CSV |
As a developer, it’s difficult to programmatically manipulate Excel files since the Excel is proprietary. This is especially true for languages other than .NET | As a developer it’s easy to programmatically manipulate CSV since, after all, they are simple text files. |
简而言之对于excel的文件(如xls, xlsx
)
- 开放标准, 兼容性不是问题
- 文件结构简单, 解析速度更快
- 数据密度高(等量数据, 文件更小)
相近的格式还有, 如TSV
(制表符作为分隔符).
三. read_csv
(图片较大, 右键打开)
进一步归纳:
- 基本参数
- 数据转换(日期, 空值, 精度等)
- 数据异常数据处理
- 数据(范围)读取控制(内存控制)
以上处理csv
的基础原则不限于read_csv
, 其他的诸如read_excel
, read_json
等读取数据的方式, 均遵循上述规范. 通过对read_csv
的参数剖析, 可以更好理解其它方式的实现.
详细内容见: NoteBook/Pandas/read_csv_专项 at main · Kyouichirou/NoteBook (github.com)