pyspark.pandas.Series.to_csv��
-
Series.
to_csv
(path: Optional[str] = None, sep: str = ',', na_rep: str = '', columns: Optional[List[Union[Any, Tuple[Any, ���]]]] = None, header: bool = True, quotechar: str = '"', date_format: Optional[str] = None, escapechar: Optional[str] = None, num_files: Optional[int] = None, mode: str = 'w', partition_cols: Union[str, List[str], None] = None, index_col: Union[str, List[str], None] = None, **options: Any) → Optional[str]�� Write object to a comma-separated values (csv) file.
Note
pandas-on-Spark to_csv writes files to a path or URI. Unlike pandas���, pandas-on-Spark respects HDFS���s property such as ���fs.default.name���.
Note
pandas-on-Spark writes CSV files into the directory, path, and writes multiple part-��� files in the directory when path is specified. This behavior was inherited from Apache Spark. The number of partitions can be controlled by num_files. This is deprecated. Use DataFrame.spark.repartition instead.
- Parameters
- path: str, default None
File path. If None is provided the result is returned as a string.
- sep: str, default ���,���
String of length 1. Field delimiter for the output file.
- na_rep: str, default ������
Missing data representation.
- columns: sequence, optional
Columns to write.
- header: bool or list of str, default True
Write out the column names. If a list of strings is given it is assumed to be aliases for the column names.
- quotechar: str, default ���������
String of length 1. Character used to quote fields.
- date_format: str, default None
Format string for datetime objects.
- escapechar: str, default None
String of length 1. Character used to escape sep and quotechar when appropriate.
- num_files: the number of partitions to be written in `path` directory when
this is a path. This is deprecated. Use DataFrame.spark.repartition instead.
- mode: str
Python write mode, default ���w���.
Note
mode can accept the strings for Spark writing mode. Such as ���append���, ���overwrite���, ���ignore���, ���error���, ���errorifexists���.
���append��� (equivalent to ���a���): Append the new data to existing data.
���overwrite��� (equivalent to ���w���): Overwrite existing data.
���ignore���: Silently ignore this operation if data already exists.
���error��� or ���errorifexists���: Throw an exception if data already exists.
- partition_cols: str or list of str, optional, default None
Names of partitioning columns
- index_col: str or list of str, optional, default: None
Column names to be used in Spark to represent pandas-on-Spark���s index. The index name in pandas-on-Spark is ignored. By default, the index is always lost.
- options: keyword arguments for additional options specific to PySpark.
These kwargs are specific to PySpark���s CSV options to pass. Check the options in PySpark���s API documentation for spark.write.csv(���). It has higher priority and overwrites all other options. This parameter only works when path is specified.
- Returns
- str or None
Examples
>>> df = ps.DataFrame(dict( ... date=list(pd.date_range('2012-1-1 12:00:00', periods=3, freq='M')), ... country=['KR', 'US', 'JP'], ... code=[1, 2 ,3]), columns=['date', 'country', 'code']) >>> df.sort_values(by="date") date country code ... 2012-01-31 12:00:00 KR 1 ... 2012-02-29 12:00:00 US 2 ... 2012-03-31 12:00:00 JP 3
>>> print(df.to_csv()) date,country,code 2012-01-31 12:00:00,KR,1 2012-02-29 12:00:00,US,2 2012-03-31 12:00:00,JP,3
>>> df.cummax().to_csv(path=r'%s/to_csv/foo.csv' % path, num_files=1) >>> ps.read_csv( ... path=r'%s/to_csv/foo.csv' % path ... ).sort_values(by="date") date country code ... 2012-01-31 12:00:00 KR 1 ... 2012-02-29 12:00:00 US 2 ... 2012-03-31 12:00:00 US 3
In case of Series,
>>> print(df.date.to_csv()) date 2012-01-31 12:00:00 2012-02-29 12:00:00 2012-03-31 12:00:00
>>> df.date.to_csv(path=r'%s/to_csv/foo.csv' % path, num_files=1) >>> ps.read_csv( ... path=r'%s/to_csv/foo.csv' % path ... ).sort_values(by="date") date ... 2012-01-31 12:00:00 ... 2012-02-29 12:00:00 ... 2012-03-31 12:00:00
You can preserve the index in the roundtrip as below.
>>> df.set_index("country", append=True, inplace=True) >>> df.date.to_csv( ... path=r'%s/to_csv/bar.csv' % path, ... num_files=1, ... index_col=["index1", "index2"]) >>> ps.read_csv( ... path=r'%s/to_csv/bar.csv' % path, index_col=["index1", "index2"] ... ).sort_values(by="date") date index1 index2 ... ... 2012-01-31 12:00:00 ... ... 2012-02-29 12:00:00 ... ... 2012-03-31 12:00:00