Skip to content

Things to Do With .ipynb Files

Although Jupyter Notebook is designed primarily for data analysts, I think there are other use cases in which it can be an effective method of distributing Python code and associated documentation in a single .ipynb file. There are other editors we can use to handle .ipynb files: JupyterLab, Deepnote and Visual Studio Code (with the right extensions) are just a few I've tried.

Jupyter Notebook is essentially a markdown editor with an embedded Python interpreter that executes whatever code is included in a document. All the code within a document can be run as a single Python script, but it's possible to run the contents of specific code blocks.

I got started by installing the Anaconda Navigator. Anaconda appears to be a Web server that hosts a range of data analysis applications, including Jupyter Notebook and JupyterLab, a Python interpreter and a large collection of modules that a data analyst might use. It can also launch the Spyder IDE and PyQT Console.

To get started, run the Anaconda Navigator, and in the options select 'Jupyter Notebook'. The application's interface is accessible in a Web browser at localhost:8888.

All the mathematical and arithmetic things are done the same way as with any Python-capable IDE, because that's essentially what's happening here. I haven't tried using specialist Python modules (e.g. networking, cryptography, etc.) with Jupyter yet, though.

There are functions that are particularly relevant for working with data sets. To find the max and min values in a set of variables:

min(myFirstVariable, mySecondVariable, myThirdVariable)
max(myFirstVariable, mySecondVariable, myThirdVariable)

And use this to find the range of the set:

largest = max(myFirstVariable, mySecondVariable, myThirdVariable)
smallest = min(myFirstVariable, mySecondVariable, myThirdVariablel)
mySetRange = largest - smallest

However, a more efficient way of dealing with relatively small data sets might be to use arrays:

dataSet = [2312, 1339, 9878, 4521]
average = sum(dataSet)/len(dataSet)

for i in dataSet:

If we wanted to find the median of the set, we might be better off using the median() function from the statistics module:

import statistics as stats

Loading JSON Arrays

What about if we really wanted to work with data sets with real-world data, and use Jupyter/Python to query data sources with numerous records? One method is to read it from a JSON array.

jsonData = { 'FirstVariable' : {'name': 'First Variable', 'value' : 498 },
            'SecondVariable' : {'name': 'Second Variable', 'value' : 677 }, 
            'ThirdVariable' : {'name': 'Third Variable', 'value' : 121 }}

To retrieve a specific named element:


We can load JSON from an external file:

import json
jsonFile = open("sample.json",)

data = json.load(jsonFile)

Importing Tables

Of course, Jupyter Notebook wouldn't be of much use if it could work only with a limited amount of data. Fortunately there are ways of importing and working with much larger data sets, using the pandas module. To read a spreadsheet file and display its contents in Jupyter:

from pandas import *
data = read_excel('MySpreadsheet.xls')

The table that appears is called a 'dataframe'.

You'll probably want to work on data in a given column. The following code declares tbColumn as an array of values in the spreadsheet's Value column.

tbColumn = data['Value']

The output will be something like:

0     2343
1    23653
2     5312
3     9884
Name: Value, dtype: int64

So, tbColumn is an array of int64 values. We can, of course, retrieve whatever values we want from this, in the usual way. e.g.


There are a few handy things we can do with an array in Python: - tbColumn.sum() - tbColumn.min() - tbColumn.max() - tbColumn.mean() - tbColumn.median()

We also have a table sorting function, so we can sort rows by the values in a given column:


CSV Data Sources and Matplotlib - A Real-World Example

I've used using pandas to read and query real-world data provided by Our World in Data as .csv files - the ones used here were downloaded on 26th November. The first data set is 'UK: Daily new confirmed COVID-19 cases per 100,000'.

from pandas import *
csvdata = read_csv('covid-cases.csv')

As expected, there was a large number of records, and they were for each region in Britain. I wanted just the stats for Wales in 2021. It was at this point that I discovered there's a pandas.query() function that enables us to use SQL-like syntax for this.

csvdata = read_csv('covid-cases.csv')
filteredData = csvdata.query("(Entity=='Wales') and (Day > '2021-01-01')")

For some very basic data visualisation, I used Matplotlib, setting the two columns in my filtered data as the x and y axes.

import matplotlib.pyplot as plt

x = filteredData.Day
y = filteredData.daily_cases_rate_rolling_average

On the Beach

The same could be done with the second data set, which is for 'UK: Number of COVID-19 patients in hospital'. It's very important to be very careful when naming the variables here, to avoid accidentally reading, querying or rendering from the previous data set (Note: I didn't filter for 2021 data in this second example).

import matplotlib.pyplot as plt2

csvpatients = read_csv('covid-hospital.csv')

filteredHospitalData = csvpatients.query("(Entity=='Wales') and (Day > '2020-01-01')")

xh = filteredHospitalData.Day
yh = filteredHospitalData.people_in_hospital

plt2.plot(xh, yh)

The .ipynb and .csv files used here can be downloaded from my GitHub repo.