CSV to ORC Conversion Tool

This script is used to convert CSV files to ORC files.

Usage

$ python csv_to_orc.py --input <path/to/input.csv> --output_dir <path/to/output/orc/files>

Parameters

--input, -i

Input CSV file path.

--output_dir, -o

Output ORC file directory path.

--blocksize

Block size used when reading CSV, default value is ‘64MB’. For example: ‘64MB’, ‘128MB’, etc.

--no-header

Specify this option if the CSV file has no header row.

Tool Description

RecIS reads data in columnar ORC file format, so CSV files need to be converted to ORC files before training. Dask is used when reading CSV files to improve the efficiency of reading and writing files.

Example

Convert a CSV file named “data.csv” to ORC files in the “/data/orc/” directory:

$ python csv_to_orc.py --input data.csv --output_dir /data/orc/