Reading Pickle Files in Pandas using read_pickle (2024)

Mokhtar EbrahimLast Updated On: October 16, 2023

read_pickle in Pandas allows you to load pickled Pandas objects.

It can load data such as DataFrames and Series that were saved using Pandas to_pickle method.

In this tutorial, We’ll uncover its syntax, load pickle files into DataFrames, and benchmark its performance under different compression algorithms.

Table of Contents hide

  • 1 Pandas read_pickle Syntax and Parameters
  • 2 Risks of Unpickling Data from Untrusted Sources
  • 3 How to read a pickle file
  • 4 Read Compressed Pickle
  • 5 Read first Row or n Rows from Pickle
  • 6 Benchmark for Different Compression Algorithms
  • 7 Error Handling and Troubleshooting
    • 7.1 Solutions
  • 8 Resource

Pandas read_pickle Syntax and Parameters

The basic syntax for read_pickleis as follows:

pandas.read_pickle(filepath_or_buffer, compression='infer', storage_options=None)
  • filepath_or_buffer: The path to the file which contains the pickled object. This can be either a string representing the file path, a file-like object, or a bytes-like object.
  • compression: The type of compression to use, if any. By default, it’s set to ‘infer’, which means the method will try to infer the compression type from the file extension. Supported compression types include ‘bz2’, ‘gzip’, ‘xz’, and ‘zip’.
  • storage_options: This is a dict parameter which is relevant if you’re using specific storage connection settings, especially when working with remote storage like S3 or GCS.

Risks of Unpickling Data from Untrusted Sources

It’s crucial to understand the potential dangers associated with unpickling data, especially when the source of that data is untrusted.

  1. Arbitrary Code Execution: The pickled data can contain executable code.
    Unpickling from an untrusted source can run arbitrary code, potentially harming your system or compromising sensitive data.
  2. Denial of Service (DoS): A specially crafted pickled file can cause your application to crash or hang, leading to a Denial of Service attack.

Never unpickle data received from untrusted or unauthenticated sources

How to read a pickle file

You can use read_pickle function to read pickle file in Panads like this:

import pandas as pddf = pd.read_pickle('sample_data.pkl')

Output:

 Name Age Salary0 Alex 25 500001 John 30 600002 Jane 28 55000

Here, we loaded the DataFrame stored in the “sample_data.pkl” file.

The DataFrame, as shown in the output, has three columns: ‘Name’, ‘Age’, and ‘Salary’ and three entries for demonstration.

Read Compressed Pickle

Pandas natively supports several compression protocols:

  • gzip: An extensively used compression method, particularly suitable for textual data.
  • bz2: Another compression method that often provides a better compression ratio than gzip, albeit at a slightly slower speed.
  • xz: Provides one of the best compression ratios, although it can be much slower than the other methods.
  • zip: Widely known and used, it is also supported by Pandas for both pickling and reading.

Assuming you have pickled and compressed a DataFrame using one of the supported methods, you can read the compressed file directly using read_pickle by specifying the appropriate compression type.

For gzip compression:

df_gzip = pd.read_pickle('dataframe.pkl.gz', compression='gzip')

For bz2 compression:

df_bz2 = pd.read_pickle('dataframe.pkl.bz2', compression='bz2')

For xz compression:

df_xz = pd.read_pickle('dataframe.pkl.xz', compression='xz')

For zip compression:

df_zip = pd.read_pickle('dataframe.pkl.zip', compression='zip')

One of the handy features is that if you omit the compression parameter when calling read_pickle, Pandas will try to infer the compression based on the file extension.

Read first Row or n Rows from Pickle

Unlike CSV or other textual formats, pickled files are not designed for partial reading.

The primary mechanism with pickles is all or nothing. Once the data is loaded, you can easily access the first row.

Load the Pickle and Access the First Row:

df = pd.read_pickle('dataframe.pkl')first_row = df.iloc[0]

Output:

A 1B 4Name: 0, dtype: int64

The output showcases the values from the first row of our sample DataFrame. Here, we’ve used the iloc property of the DataFrame.

Alternative Method using head():

Pandas DataFrames have a built-in method called head(), which returns the first n rows of the DataFrame.

first_row_with_head = df.head(1)

Output:

 A B0 1 4
If you want to retrieve the first 10 rows, you'll use df.head(10).

Output (for n=3, as an example):

 A B0 1 41 2 52 3 6

Benchmark for Different Compression Algorithms

Below is a Python code that creates sample pickle files with different compressions and benchmarks the reading times using read_pickle:

import pandas as pdimport timedata = {'A': range(1, 100001), 'B': range(100001, 1, -1)}df = pd.DataFrame(data)# Pickle with different compressionsdf.to_pickle("dataframe.pkl") # No compressiondf.to_pickle("dataframe_gzip.pkl.gz", compression='gzip')df.to_pickle("dataframe_bz2.pkl.bz2", compression='bz2')df.to_pickle("dataframe_xz.pkl.xz", compression='xz')df.to_pickle("dataframe_zip.pkl.zip", compression='zip')# Measure load timesfiles = ["dataframe.pkl", "dataframe_gzip.pkl.gz", "dataframe_bz2.pkl.bz2", "dataframe_xz.pkl.xz", "dataframe_zip.pkl.zip"]compression_methods = ["No Compression", "gzip", "bz2", "xz", "zip"]for file, method in zip(files, compression_methods): start_time = time.time() _ = pd.read_pickle(file) end_time = time.time() elapsed_time = end_time - start_time print(f"Reading time with {method}: {elapsed_time:.4f} seconds")

Output:

Reading time with No Compression: 0.0932 secondsReading time with gzip: 0.7555 secondsReading time with bz2: 4.9183 secondsReading time with xz: 2.1486 secondsReading time with zip: 0.7317 seconds

As you can see, gzip and zip compression is the fastest compression you can read from.

The slowest one is the xz compression yet, in case of creating the pickle file it was the smallest in size.

Error Handling and Troubleshooting

One of the common errors when unpickling in Pandas is the unsupported pickle protocol issue.

This arises due to version mismatches between the Python libraries that were used to pickle the data and those being used to unpickle it.

The error message might look something like: ValueError: unsupported pickle protocol: 5.

Solutions

Upgrade Python: If the error is due to an older Python version, consider upgrading to a newer one that supports the required protocol.

Re-Pickle with a Lower Protocol: If you have access to the environment where the data was originally pickled, you can re-pickle it specifying a lower protocol. For example:python
df.to_pickle("dataframe_lower_protocol.pkl", protocol=4)

Use a Virtual Environment: If you need to maintain multiple Python versions or library versions, consider using tools like venv or conda to create isolated environments.

General Troubleshooting Tips

Always check the versions of Python and Pandas when facing such issues. This can be done using:

import sysprint(sys.version)

and

print(pd.__version__)

Resource

https://pandas.pydata.org/docs/reference/api/pandas.read_pickle.html

Reading Pickle Files in Pandas using read_pickle (1)

Mokhtar Ebrahim

Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.

Related posts

  1. Read Excel files using Pandas read_excel
  2. Read JSON files using Python Pandas read_json
  3. Read SQL Query/Table into DataFrame using Pandas read_sql
  4. Read HTML tables using Pandas read_html function
  5. Read Parquet files using Pandas read_parquet
  6. Parsing XML Files into DataFrames using Pandas read_xml
  7. Select columns using Pandas read_csv usecols parameter
Reading Pickle Files in Pandas using read_pickle (2024)

FAQs

How to read pickle files in Pandas? ›

Reading Pickle Files Using Pandas

This function takes the name of the pickle file as an argument and returns a pandas DataFrame. One can read pickle files in Python using the read_pickle() function. Similar to the read_csv() function, this function will also return a Pandas DataFrame as output.

How do I see the contents of a pickle file? ›

Command line usage

When invoked from the command line, python -m pickletools will disassemble the contents of one or more pickle files. Note that if you want to see the Python object stored in the pickle rather than the details of pickle format, you may want to use -m pickle instead.

How to read model pickle file in Python? ›

To load a saved model from a Pickle file, all you need to do is pass the “pickled” model into the Pickle load() function and it will be deserialized. By assigning this back to a model object, you can then run your original model's predict() function, pass in some test data and get back an array of predictions.

How to view .pkl file? ›

If you cannot open your PKL file correctly, try to right-click or long-press the file. Then click "Open with" and choose an application. You can also display a PKL file directly in the browser: Just drag the file onto this browser window and drop it.

What is the difference between Pandas read pickle and CSV? ›

Pickle is a serialized way of storing a Pandas dataframe. Basically, you are writing down the exact representation of the dataframe to disk. This means the types of the columns are and the indices are the same. If you simply save a file as csv , you are just storing it as a comma separated list.

How do I load a pickle file? ›

Pickling with a File

In this example, we will use a pickle file to first write the data in it using the pickle. dump() function. Then using the pickle. load() function, we will load the pickle fine in Python script and print its data in the form of a Python dictionary.

Which of the following methods is used to read data from a pickle file? ›

Similarly, load() reads pickled objects from a file, whereas loads() deserializes them from a bytes-like object. In this tutorial, we will be using the dump() and load() functions to pickle Python objects to a file and unpickle them.

How to check if a pickle file exists in Python? ›

We use the is_file() function, which is part of the Path class from the pathlib module, or exists() function, which is part of the os. path module, in order to check if a file exists or not in Python.

How to use pickle in Python? ›

To use pickle in Python, we must first create an object like my_object. Then, using the dump() function and the 'wb' parameter on open(), we can serialize this object to a file called my_object. pickle for future usage. This binary write mode ensures that our data is secure and stored correctly for later access.

What is a .pickle file? ›

Pickle in Python is primarily used in serializing and deserializing a Python object structure. In other words, it's the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.

What is the difference between pickle and marshal in Python? ›

The pickle module differs from marshal in several significant ways: The pickle module keeps track of the objects it has already serialized, so that later references to the same object won't be serialized again. marshal doesn't do this. This has implications both for recursive objects and object sharing.

How to open pickle file in pandas? ›

read_pickle() method in Pandas. File path where the pickled object will be loaded. For on-the-fly decompression of on-disk data. If 'infer', then use gzip, bz2, xz or zip if path ends in '.

How to parse a PKL file? ›

  1. . pkl file are Run by Python.
  2. You need to install a module named Pickle for open . pkl file in binary mode.
  3. import pickle.
  4. with open('filename.pkl', 'rb') as file:
  5. my_object = pickle.load(file)
  6. print(my_object)
Feb 1, 2023

What is the extension of a Python pickle file? ›

Python pickle files may have the extension ". pickle" or ". pkl".

How to read ODS file using Pandas? ›

read_excel() returns a new DataFrame that contains the values from data. xlsx . You can also use read_excel() with OpenDocument spreadsheets, or . ods files.

How do I read a zip file in Pandas? ›

1️⃣ Read zip files without going through the whole process of unzipping before reading. 2️⃣ Write files into a compressed format, rather than saving them in memory before compression. ✨ Specify the compression format when reading a single zip file. ✨ Use the ZipFile class when reading a file from a zip folder.

How to read a binary file using pickle? ›

Reading all records of binary file using pickle module

The read_records() function reads record from the binary file and displayed on the screen using the same object. If the end of the file is already reached, the load function will raise an EOFError exception.

Top Articles
Latest Posts
Article information

Author: Mrs. Angelic Larkin

Last Updated:

Views: 5497

Rating: 4.7 / 5 (47 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Mrs. Angelic Larkin

Birthday: 1992-06-28

Address: Apt. 413 8275 Mueller Overpass, South Magnolia, IA 99527-6023

Phone: +6824704719725

Job: District Real-Estate Facilitator

Hobby: Letterboxing, Vacation, Poi, Homebrewing, Mountain biking, Slacklining, Cabaret

Introduction: My name is Mrs. Angelic Larkin, I am a cute, charming, funny, determined, inexpensive, joyous, cheerful person who loves writing and wants to share my knowledge and understanding with you.