%load_ext pretty_jupyter
Introduction - Using Packages¶
This walkthrough is part of the ECLR page.
In Python, the concept of packages and libraries is central to its versatility and power. A library in Python is a collection of pre-written functions and tools that extend Python's basic functionality, while a package is essentially a collection of related libraries (python modules) bundled together. Python’s open-source nature means that developers worldwide contribute to its ecosystem, resulting in thousands of packages tailored for various tasks. Whether you're performing data manipulation, creating stunning visualizations, or building machine learning models, Python has a library for it. Some popular examples include:
pandas
: for data manipulation and analysis.matplotlib
andlets_plot
: for creating flexible data visualizations.scikit-learn
: for machine learning.NumPy
: for numerical computing.
To use these functions you will have to do the following. Identify which package will help you in your project, find the package and download it to your computer. All of this you will only have to do once. But then you will have to load the package everytime you wish to actually use it in a project.
Which package and where to find it¶
When you need to perform a specific task in Python, the first step is to identify the library or libraries best suited for the job. For instance, if you're handling tabular data, a quick search like "Python library for data manipulation" will lead you to pandas
. Moreover, if you're looking to visualize data, matplotlib
, seaborn
, lets_plot
, or plotly
are excellent options. On the other hand, for advanced statistical modeling, libraries like statsmodels
or scipy
will often be recommended.
Python’s ecosystem is extensively documented, with platforms such as the Python Package Index PyPI and GitHub serving as excellent resources for exploring and downloading libraries. Notably, PyPI is the official distribution repository for Python packages, enabling developers to share and consume packages with ease. This repository is a rich source of resources, hosting over a quarter of a million unique packages contributed by developers worldwide. These contributions represent a wide variety of tools, scripts, and codebases, each enhancing Python’s functionality. By selecting and installing the appropriate package, you can significantly expand Python's capabilities to meet your specific needs.
Download and install packages¶
Most Python packages are hosted on PyPI and can be installed easily using the pip package manager, which comes bundled with Python installations. If you’re using a Jupyter Notebook, you can even run pip commands directly from within your notebook using the !
prefix. For instance, to install the pandas library:
!pip install pandas
However, as you only need to install a package once, your notebook (where you want to save your workflow) isn't really the best place to do that. You can also install packages via the Terminal. The terminal is a tool via which you can type instructions to your computer (as opposed to Python). And the installing is done by your computer and not by Python. Let's say we need to install the packages pandas
and numpy
as we need them later.
Open the terminal by clicking on the Terminal item in the menu and chosing New Terminal. This will open at the bottom of your screen as indicated in this screenshot.
In the terminal you type pip install
and then the name of the package you wish to install, here pandas
. Then press enter and the computer will do its stuff. When you see the prompt again (here C:\Pycode\ECLR
) and there is no error message you are good to go. Repeat that for all the packages needed.
Upgrading und uninstalling packages¶
If you want to upgrade an existing package to it latest version, then you can use:
pip install --upgrade package_name
To unistall a package- that is to remove a package from your environment, you can use:
pip uninstall package_name
Also, you can see what packages you have previously installed by entering pip list
into the command line:
pip list
Another good place to indicate python packages is through the Anaconda repository. Anaconda is a free and open-source distribution of the Python programming language, designed with a focus on data science, large-scale data processing, and machine learning applications. Unlike the standard Python distribution, Anaconda comes with a range of built-in modules tailored for these specific domains. It also utilizes a specialized package manager called Conda, which manages package installations and dependencies efficiently, further enhancing its utility for data-intensive tasks.
To install a package you can also run the conda
function in your Jupyter Notebook file:
conda install package_name
Loading Libraries¶
Once a library is installed, you’ll need to load it into your notebook using the import
statement. It’s a good practice to load all the libraries you’ll use at the beginning of your notebook to keep things organized. For instance:
import pandas as pd # for data manipulation
import matplotlib.pyplot as plt # for data visualization
import numpy as np # for numerical computations
import statistics as stats # for statistics
If you haven't installed some of the above packages, let's say the pandas
package, then you will get an error message like this:
ModuleNotFoundError: No module named 'pandas'
Now that you know how to install and loaded the packages you need for your project, it’s equally important to know how to access information about them. You can do this in several ways:
- Use the
help()
function to view detailed documentation. - Use the
?
to access help for a function. - Check the official documentation for Python libraries online. Best to use your internet search engine, eg. search for "Python help with pandas read_csv function" and you will find useful links with examples of the function being used.
- Take advantage of
Tab
completion in Jupyter Notebook to see available methods and attributes. As you write a function your IDE will offer advice on what inputs are expected.
Examples include:
Use the help()
function to view detailed documentation.
# Get help for a the DataFrame function in pandas
help(pd.DataFrame)
Use the ?
or ??
syntax in Jupyter Notebook for quick insights or to explore the source code.
# Inline help - both lines work
#pd.DataFrame?
?pd.DataFrame
Init signature: pd.DataFrame( data=None, index: 'Axes | None' = None, columns: 'Axes | None' = None, dtype: 'Dtype | None' = None, copy: 'bool | None' = None, ) Docstring: Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure. Parameters ---------- data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. .. versionchanged:: 0.25.0 If data is a list of dicts, column order follows insertion-order. index : Index or array-like Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided. columns : Index or array-like Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels, will perform column selection instead. dtype : dtype, default None Data type to force. Only a single dtype is allowed. If None, infer. copy : bool or None, default None Copy data from inputs. For dict data, the default of None behaves like ``copy=True``. For DataFrame or 2d ndarray input, the default of None behaves like ``copy=False``. .. versionchanged:: 1.3.0 See Also -------- DataFrame.from_records : Constructor from tuples, also record arrays. DataFrame.from_dict : From dicts of Series, arrays, or dicts. read_csv : Read a comma-separated values (csv) file into DataFrame. read_table : Read general delimited file into DataFrame. read_clipboard : Read text from clipboard into DataFrame. Examples -------- Constructing DataFrame from a dictionary. >>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = pd.DataFrame(data=d) >>> df col1 col2 0 1 3 1 2 4 Notice that the inferred dtype is int64. >>> df.dtypes col1 int64 col2 int64 dtype: object To enforce a single dtype: >>> df = pd.DataFrame(data=d, dtype=np.int8) >>> df.dtypes col1 int8 col2 int8 dtype: object Constructing DataFrame from a dictionary including Series: >>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])} >>> pd.DataFrame(data=d, index=[0, 1, 2, 3]) col1 col2 0 0 NaN 1 1 NaN 2 2 2.0 3 3 3.0 Constructing DataFrame from numpy ndarray: >>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), ... columns=['a', 'b', 'c']) >>> df2 a b c 0 1 2 3 1 4 5 6 2 7 8 9 Constructing DataFrame from a numpy ndarray that has labeled columns: >>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)], ... dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")]) >>> df3 = pd.DataFrame(data, columns=['c', 'a']) ... >>> df3 c a 0 3 1 1 6 4 2 9 7 Constructing DataFrame from dataclass: >>> from dataclasses import make_dataclass >>> Point = make_dataclass("Point", [("x", int), ("y", int)]) >>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)]) x y 0 0 0 1 0 3 2 2 3 File: c:\users\msassrb2\anaconda3\lib\site-packages\pandas\core\frame.py Type: type Subclasses: SubclassedDataFrame
Alternative importing¶
Above the libraries were imported with a command like import pandas as pd
. This made all the functions of the pandas
package available to you as the user as long as you now address the function by ots name with the pd.
prefix, e.g. pd.read_csv()
.
The reason why we do this is that there may be multiple libraries which have functions of the same name, say mean
. For instance the numpy
package has a mean
function. But there may are other libraries which also have a mean
function. For instance the statistics
library.
help(np.mean)
Help on function mean in module numpy: mean(a, axis=None, dtype=None, out=None, keepdims=<no value>, *, where=<no value>) Compute the arithmetic mean along the specified axis. Returns the average of the array elements. The average is taken over the flattened array by default, otherwise over the specified axis. `float64` intermediate and return values are used for integer inputs. Parameters ---------- a : array_like Array containing numbers whose mean is desired. If `a` is not an array, a conversion is attempted. axis : None or int or tuple of ints, optional Axis or axes along which the means are computed. The default is to compute the mean of the flattened array. .. versionadded:: 1.7.0 If this is a tuple of ints, a mean is performed over multiple axes, instead of a single axis or all the axes as before. dtype : data-type, optional Type to use in computing the mean. For integer inputs, the default is `float64`; for floating point inputs, it is the same as the input dtype. out : ndarray, optional Alternate output array in which to place the result. The default is ``None``; if provided, it must have the same shape as the expected output, but the type will be cast if necessary. See :ref:`ufuncs-output-type` for more details. keepdims : bool, optional If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array. If the default value is passed, then `keepdims` will not be passed through to the `mean` method of sub-classes of `ndarray`, however any non-default value will be. If the sub-class' method does not implement `keepdims` any exceptions will be raised. where : array_like of bool, optional Elements to include in the mean. See `~numpy.ufunc.reduce` for details. .. versionadded:: 1.20.0 Returns ------- m : ndarray, see dtype parameter above If `out=None`, returns a new array containing the mean values, otherwise a reference to the output array is returned. See Also -------- average : Weighted average std, var, nanmean, nanstd, nanvar Notes ----- The arithmetic mean is the sum of the elements along the axis divided by the number of elements. Note that for floating-point input, the mean is computed using the same precision the input has. Depending on the input data, this can cause the results to be inaccurate, especially for `float32` (see example below). Specifying a higher-precision accumulator using the `dtype` keyword can alleviate this issue. By default, `float16` results are computed using `float32` intermediates for extra precision. Examples -------- >>> a = np.array([[1, 2], [3, 4]]) >>> np.mean(a) 2.5 >>> np.mean(a, axis=0) array([2., 3.]) >>> np.mean(a, axis=1) array([1.5, 3.5]) In single precision, `mean` can be inaccurate: >>> a = np.zeros((2, 512*512), dtype=np.float32) >>> a[0, :] = 1.0 >>> a[1, :] = 0.1 >>> np.mean(a) 0.54999924 Computing the mean in float64 is more accurate: >>> np.mean(a, dtype=np.float64) 0.55000000074505806 # may vary Specifying a where argument: >>> a = np.array([[5, 9, 13], [14, 10, 12], [11, 15, 19]]) >>> np.mean(a) 12.0 >>> np.mean(a, where=[[True], [False], [False]]) 9.0 Help on function mean in module statistics: mean(data) Return the sample arithmetic mean of data. >>> mean([1, 2, 3, 4, 4]) 2.8 >>> from fractions import Fraction as F >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)]) Fraction(13, 21) >>> from decimal import Decimal as D >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")]) Decimal('0.5625') If ``data`` is empty, StatisticsError will be raised.
help(stats.mean)
Help on function mean in module statistics: mean(data) Return the sample arithmetic mean of data. >>> mean([1, 2, 3, 4, 4]) 2.8 >>> from fractions import Fraction as F >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)]) Fraction(13, 21) >>> from decimal import Decimal as D >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")]) Decimal('0.5625') If ``data`` is empty, StatisticsError will be raised.
If you are using both libraries, only if you do import in the way above, Python will know which mean
function you are referring to. It is also useful for yourself or the reader of your code to have a short prefix as it makes it clear from which library a function comes from.
Sometimes there are libraries where it is very unlikely that there is a naming conflict. One such function is the plotting library lets_plot
as it basically has only one (but very powerful as you can see here) function, ggplot
. In that case you can import that library as follows:
from lets_plot import *
This means that you can access the function of that library directly with out any prefix. But to do that you will have to be certain that no naming conflict is created by importing this library. It should be used as an exception.
Summary¶
Using libraries and packages is an essential skill for any Python user. Most of the functionality you will use as a data analyst, data scientist, or econometrician comes from Python's extensive library ecosystem. The majority of packages can be installed from PyPI, Python’s default package manager. Occasionally, you may need to install a package from GitHub or another source, in which case specific installation instructions are usually provided.
By mastering how to install, load, and explore Python libraries, you unlock the full potential of Python's capabilities for your projects. Whether you're performing data manipulation, building machine learning models, or visualizing complex datasets, Python’s libraries are your most powerful tools.
This walkthrough is part of the ECLR page.