pyspark.pandas.Series.to_csv��

Series.to_csv(path: Optional[str] = None, sep: str = ',', na_rep: str = '', columns: Optional[List[Union[Any, Tuple[Any, ��]]]] = None, header: bool = True, quotechar: str = '"', date_format: Optional[str] = None, escapechar: Optional[str] = None, num_files: Optional[int] = None, mode: str = 'w', partition_cols: Union[str, List[str], None] = None, index_col: Union[str, List[str], None] = None, **options: Any) → Optional[str]��

Write object to a comma-separated values (csv) file.

Note

pandas-on-Spark to_csv writes files to a path or URI. Unlike pandas��, pandas-on-Spark respects HDFS��s property such as ��fs.default.name��.

Note

pandas-on-Spark writes CSV files into the directory, path, and writes multiple part-�� files in the directory when path is specified. This behavior was inherited from Apache Spark. The number of partitions can be controlled by num_files. This is deprecated. Use DataFrame.spark.repartition instead.

Parameters

path: str, default None

File path. If None is provided the result is returned as a string.

sep: str, default ��,��

String of length 1. Field delimiter for the output file.

na_rep: str, default ��

Missing data representation.

columns: sequence, optional

Columns to write.

header: bool or list of str, default True

Write out the column names. If a list of strings is given it is assumed to be aliases for the column names.

quotechar: str, default ��

String of length 1. Character used to quote fields.

date_format: str, default None

Format string for datetime objects.

escapechar: str, default None

String of length 1. Character used to escape sep and quotechar when appropriate.

num_files: the number of partitions to be written in `path` directory when

this is a path. This is deprecated. Use DataFrame.spark.repartition instead.

mode: str

Python write mode, default ��w��.

Note

mode can accept the strings for Spark writing mode. Such as ��append��, ��overwrite��, ��ignore��, ��error��, ��errorifexists��.

��append�� (equivalent to ��a��): Append the new data to existing data.
��overwrite�� (equivalent to ��w��): Overwrite existing data.
��ignore��: Silently ignore this operation if data already exists.
��error�� or ��errorifexists��: Throw an exception if data already exists.

partition_cols: str or list of str, optional, default None

Names of partitioning columns

index_col: str or list of str, optional, default: None

Column names to be used in Spark to represent pandas-on-Spark��s index. The index name in pandas-on-Spark is ignored. By default, the index is always lost.

options: keyword arguments for additional options specific to PySpark.

These kwargs are specific to PySpark��s CSV options to pass. Check the options in PySpark��s API documentation for spark.write.csv(��). It has higher priority and overwrites all other options. This parameter only works when path is specified.

Returns

str or None

See also

read_csv
DataFrame.to_delta
DataFrame.to_table
DataFrame.to_parquet
DataFrame.to_spark_io

Examples

>>> df = ps.DataFrame(dict(
...    date=list(pd.date_range('2012-1-1 12:00:00', periods=3, freq='M')),
...    country=['KR', 'US', 'JP'],
...    code=[1, 2 ,3]), columns=['date', 'country', 'code'])
>>> df.sort_values(by="date")  
                   date country  code
... 2012-01-31 12:00:00      KR     1
... 2012-02-29 12:00:00      US     2
... 2012-03-31 12:00:00      JP     3

>>> print(df.to_csv())  
date,country,code
2012-01-31 12:00:00,KR,1
2012-02-29 12:00:00,US,2
2012-03-31 12:00:00,JP,3

>>> df.cummax().to_csv(path=r'%s/to_csv/foo.csv' % path, num_files=1)
>>> ps.read_csv(
...    path=r'%s/to_csv/foo.csv' % path
... ).sort_values(by="date")  
                   date country  code
... 2012-01-31 12:00:00      KR     1
... 2012-02-29 12:00:00      US     2
... 2012-03-31 12:00:00      US     3

In case of Series,

>>> print(df.date.to_csv())  
date
2012-01-31 12:00:00
2012-02-29 12:00:00
2012-03-31 12:00:00

>>> df.date.to_csv(path=r'%s/to_csv/foo.csv' % path, num_files=1)
>>> ps.read_csv(
...     path=r'%s/to_csv/foo.csv' % path
... ).sort_values(by="date")  
                   date
... 2012-01-31 12:00:00
... 2012-02-29 12:00:00
... 2012-03-31 12:00:00

You can preserve the index in the roundtrip as below.

>>> df.set_index("country", append=True, inplace=True)
>>> df.date.to_csv(
...     path=r'%s/to_csv/bar.csv' % path,
...     num_files=1,
...     index_col=["index1", "index2"])
>>> ps.read_csv(
...     path=r'%s/to_csv/bar.csv' % path, index_col=["index1", "index2"]
... ).sort_values(by="date")  
                             date
index1 index2
...    ...    2012-01-31 12:00:00
...    ...    2012-02-29 12:00:00
...    ...    2012-03-31 12:00:00

pyspark.pandas.Series.to_json pyspark.pandas.Series.to_excel