%load_ext pretty_jupyter

Introduction - Using Packages

This walkthrough is part of the ECLR page.

In Python, the concept of packages and libraries is central to its versatility and power. A library in Python is a collection of pre-written functions and tools that extend Python's basic functionality, while a package is essentially a collection of related libraries (python modules) bundled together. Python’s open-source nature means that developers worldwide contribute to its ecosystem, resulting in thousands of packages tailored for various tasks. Whether you're performing data manipulation, creating stunning visualizations, or building machine learning models, Python has a library for it. Some popular examples include:

  • pandas: for data manipulation and analysis.
  • matplotlib and lets_plot: for creating flexible data visualizations.
  • scikit-learn: for machine learning.
  • NumPy: for numerical computing.

To use these functions you will have to do the following. Identify which package will help you in your project, find the package and download it to your computer. All of this you will only have to do once. But then you will have to load the package everytime you wish to actually use it in a project.

Which package and where to find it

When you need to perform a specific task in Python, the first step is to identify the library or libraries best suited for the job. For instance, if you're handling tabular data, a quick search like "Python library for data manipulation" will lead you to pandas. Moreover, if you're looking to visualize data, matplotlib, seaborn, lets_plot, or plotly are excellent options. On the other hand, for advanced statistical modeling, libraries like statsmodels or scipy will often be recommended.

Python’s ecosystem is extensively documented, with platforms such as the Python Package Index PyPI and GitHub serving as excellent resources for exploring and downloading libraries. Notably, PyPI is the official distribution repository for Python packages, enabling developers to share and consume packages with ease. This repository is a rich source of resources, hosting over a quarter of a million unique packages contributed by developers worldwide. These contributions represent a wide variety of tools, scripts, and codebases, each enhancing Python’s functionality. By selecting and installing the appropriate package, you can significantly expand Python's capabilities to meet your specific needs.

Download and install packages

Most Python packages are hosted on PyPI and can be installed easily using the pip package manager, which comes bundled with Python installations. If you’re using a Jupyter Notebook, you can even run pip commands directly from within your notebook using the ! prefix. For instance, to install the pandas library:

!pip install pandas

However, as you only need to install a package once, your notebook (where you want to save your workflow) isn't really the best place to do that. You can also install packages via the Terminal. The terminal is a tool via which you can type instructions to your computer (as opposed to Python). And the installing is done by your computer and not by Python. Let's say we need to install the packages pandas and numpy as we need them later.

Open the terminal by clicking on the Terminal item in the menu and chosing New Terminal. This will open at the bottom of your screen as indicated in this screenshot.

VS Code Terminal

In the terminal you type pip install and then the name of the package you wish to install, here pandas. Then press enter and the computer will do its stuff. When you see the prompt again (here C:\Pycode\ECLR) and there is no error message you are good to go. Repeat that for all the packages needed.

Upgrading und uninstalling packages

If you want to upgrade an existing package to it latest version, then you can use:

pip install --upgrade package_name

To unistall a package- that is to remove a package from your environment, you can use:

pip uninstall package_name

Also, you can see what packages you have previously installed by entering pip list into the command line:

pip list

Another good place to indicate python packages is through the Anaconda repository. Anaconda is a free and open-source distribution of the Python programming language, designed with a focus on data science, large-scale data processing, and machine learning applications. Unlike the standard Python distribution, Anaconda comes with a range of built-in modules tailored for these specific domains. It also utilizes a specialized package manager called Conda, which manages package installations and dependencies efficiently, further enhancing its utility for data-intensive tasks.

anaconda.JPG

To install a package you can also run the conda function in your Jupyter Notebook file:

conda install package_name

Loading Libraries

Once a library is installed, you’ll need to load it into your notebook using the import statement. It’s a good practice to load all the libraries you’ll use at the beginning of your notebook to keep things organized. For instance:

import pandas as pd  # for data manipulation
import matplotlib.pyplot as plt  # for data visualization
import numpy as np  # for numerical computations
import statistics as stats # for statistics

If you haven't installed some of the above packages, let's say the pandas package, then you will get an error message like this:

ModuleNotFoundError: No module named 'pandas'

Now that you know how to install and loaded the packages you need for your project, it’s equally important to know how to access information about them. You can do this in several ways:

  • Use the help() function to view detailed documentation.
  • Use the ? to access help for a function.
  • Check the official documentation for Python libraries online. Best to use your internet search engine, eg. search for "Python help with pandas read_csv function" and you will find useful links with examples of the function being used.
  • Take advantage of Tab completion in Jupyter Notebook to see available methods and attributes. As you write a function your IDE will offer advice on what inputs are expected.

Examples include:

Use the help() function to view detailed documentation.

# Get help for a the DataFrame function in pandas
help(pd.DataFrame)

Use the ? or ?? syntax in Jupyter Notebook for quick insights or to explore the source code.

# Inline help - both lines work
#pd.DataFrame?
?pd.DataFrame
Init signature:
pd.DataFrame(
    data=None,
    index: 'Axes | None' = None,
    columns: 'Axes | None' = None,
    dtype: 'Dtype | None' = None,
    copy: 'bool | None' = None,
)
Docstring:     
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, dataclass or list-like objects. If
    data is a dict, column order follows insertion-order. If a dict contains Series
    which have an index defined, it is aligned by its index.

    .. versionchanged:: 0.25.0
       If data is a list of dicts, column order follows insertion-order.

index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided.
columns : Index or array-like
    Column labels to use for resulting frame when data does not have them,
    defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
    will perform column selection instead.
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
    Copy data from inputs.
    For dict data, the default of None behaves like ``copy=True``.  For DataFrame
    or 2d ndarray input, the default of None behaves like ``copy=False``.

    .. versionchanged:: 1.3.0

See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.

Examples
--------
Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from a dictionary including Series:

>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
   col1  col2
0     0   NaN
1     1   NaN
2     2   2.0
3     3   3.0

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
...                    columns=['a', 'b', 'c'])
>>> df2
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Constructing DataFrame from a numpy ndarray that has labeled columns:

>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
...                 dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
   c  a
0  3  1
1  6  4
2  9  7

Constructing DataFrame from dataclass:

>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
   x  y
0  0  0
1  0  3
2  2  3
File:           c:\users\msassrb2\anaconda3\lib\site-packages\pandas\core\frame.py
Type:           type
Subclasses:     SubclassedDataFrame

Alternative importing

Above the libraries were imported with a command like import pandas as pd. This made all the functions of the pandas package available to you as the user as long as you now address the function by ots name with the pd. prefix, e.g. pd.read_csv().

The reason why we do this is that there may be multiple libraries which have functions of the same name, say mean. For instance the numpy package has a mean function. But there may are other libraries which also have a mean function. For instance the statistics library.

help(np.mean)
Help on function mean in module numpy:

mean(a, axis=None, dtype=None, out=None, keepdims=<no value>, *, where=<no value>)
    Compute the arithmetic mean along the specified axis.
    
    Returns the average of the array elements.  The average is taken over
    the flattened array by default, otherwise over the specified axis.
    `float64` intermediate and return values are used for integer inputs.
    
    Parameters
    ----------
    a : array_like
        Array containing numbers whose mean is desired. If `a` is not an
        array, a conversion is attempted.
    axis : None or int or tuple of ints, optional
        Axis or axes along which the means are computed. The default is to
        compute the mean of the flattened array.
    
        .. versionadded:: 1.7.0
    
        If this is a tuple of ints, a mean is performed over multiple axes,
        instead of a single axis or all the axes as before.
    dtype : data-type, optional
        Type to use in computing the mean.  For integer inputs, the default
        is `float64`; for floating point inputs, it is the same as the
        input dtype.
    out : ndarray, optional
        Alternate output array in which to place the result.  The default
        is ``None``; if provided, it must have the same shape as the
        expected output, but the type will be cast if necessary.
        See :ref:`ufuncs-output-type` for more details.
    
    keepdims : bool, optional
        If this is set to True, the axes which are reduced are left
        in the result as dimensions with size one. With this option,
        the result will broadcast correctly against the input array.
    
        If the default value is passed, then `keepdims` will not be
        passed through to the `mean` method of sub-classes of
        `ndarray`, however any non-default value will be.  If the
        sub-class' method does not implement `keepdims` any
        exceptions will be raised.
    
    where : array_like of bool, optional
        Elements to include in the mean. See `~numpy.ufunc.reduce` for details.
    
        .. versionadded:: 1.20.0
    
    Returns
    -------
    m : ndarray, see dtype parameter above
        If `out=None`, returns a new array containing the mean values,
        otherwise a reference to the output array is returned.
    
    See Also
    --------
    average : Weighted average
    std, var, nanmean, nanstd, nanvar
    
    Notes
    -----
    The arithmetic mean is the sum of the elements along the axis divided
    by the number of elements.
    
    Note that for floating-point input, the mean is computed using the
    same precision the input has.  Depending on the input data, this can
    cause the results to be inaccurate, especially for `float32` (see
    example below).  Specifying a higher-precision accumulator using the
    `dtype` keyword can alleviate this issue.
    
    By default, `float16` results are computed using `float32` intermediates
    for extra precision.
    
    Examples
    --------
    >>> a = np.array([[1, 2], [3, 4]])
    >>> np.mean(a)
    2.5
    >>> np.mean(a, axis=0)
    array([2., 3.])
    >>> np.mean(a, axis=1)
    array([1.5, 3.5])
    
    In single precision, `mean` can be inaccurate:
    
    >>> a = np.zeros((2, 512*512), dtype=np.float32)
    >>> a[0, :] = 1.0
    >>> a[1, :] = 0.1
    >>> np.mean(a)
    0.54999924
    
    Computing the mean in float64 is more accurate:
    
    >>> np.mean(a, dtype=np.float64)
    0.55000000074505806 # may vary
    
    Specifying a where argument:
    >>> a = np.array([[5, 9, 13], [14, 10, 12], [11, 15, 19]])
    >>> np.mean(a)
    12.0
    >>> np.mean(a, where=[[True], [False], [False]])
    9.0

Help on function mean in module statistics:

mean(data)
    Return the sample arithmetic mean of data.
    
    >>> mean([1, 2, 3, 4, 4])
    2.8
    
    >>> from fractions import Fraction as F
    >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
    Fraction(13, 21)
    
    >>> from decimal import Decimal as D
    >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
    Decimal('0.5625')
    
    If ``data`` is empty, StatisticsError will be raised.

help(stats.mean)
Help on function mean in module statistics:

mean(data)
    Return the sample arithmetic mean of data.
    
    >>> mean([1, 2, 3, 4, 4])
    2.8
    
    >>> from fractions import Fraction as F
    >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
    Fraction(13, 21)
    
    >>> from decimal import Decimal as D
    >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
    Decimal('0.5625')
    
    If ``data`` is empty, StatisticsError will be raised.

If you are using both libraries, only if you do import in the way above, Python will know which mean function you are referring to. It is also useful for yourself or the reader of your code to have a short prefix as it makes it clear from which library a function comes from.

Sometimes there are libraries where it is very unlikely that there is a naming conflict. One such function is the plotting library lets_plot as it basically has only one (but very powerful as you can see here) function, ggplot. In that case you can import that library as follows:

from lets_plot import *

This means that you can access the function of that library directly with out any prefix. But to do that you will have to be certain that no naming conflict is created by importing this library. It should be used as an exception.

Summary

Using libraries and packages is an essential skill for any Python user. Most of the functionality you will use as a data analyst, data scientist, or econometrician comes from Python's extensive library ecosystem. The majority of packages can be installed from PyPI, Python’s default package manager. Occasionally, you may need to install a package from GitHub or another source, in which case specific installation instructions are usually provided.

By mastering how to install, load, and explore Python libraries, you unlock the full potential of Python's capabilities for your projects. Whether you're performing data manipulation, building machine learning models, or visualizing complex datasets, Python’s libraries are your most powerful tools.

This walkthrough is part of the ECLR page.