Programming with Python (2)

class: title-slide, middle, center

# Programming with Python (2)

## Robert Castelo
[robert.castelo@upf.edu](mailto:robert.castelo@upf.edu)
### Dept. of Medicine & Life Sciences
### Universitat Pompeu Fabra

## Fundamentals of Computational Biology
### BSc on Human Biology
### UPF School of Medicine and Life Sciences
### Academic Year 2024-2025

---
class: center, middle, inverse

# Vectors and their range of valid positions
---

## Vectors

* A vector, or [array](https://en.wikipedia.org/wiki/Array_data_structure) is a
  type of object (variable) that can store **more than one single value** and
  allows for an indexed access to its values.  
  &nbsp;&nbsp;
--

* We can set a literal vector into a Python variable using an assignment:
 <pre>
 v = [1, 2, 3, 4, 5]
 </pre>
--

* We can access its values by referring to one of its **valid positions** in
  the vector.  
  &nbsp;&nbsp;
--

* Given a vector with `$n$` elements, the **valid positions** of a vector in
  Python (and in many other programming languages) go from 0 to `$n-1$`, that is,
  positions follow a
  [zero-based numbering](https://en.wikipedia.org/wiki/Zero-based_numbering).
![:scale 40%](data:image/png;base64,#img/dozeneggs.jpg)

.footer[
Image adapted from [Comp 101 Arrays: Overview](https://comp101.org/topics/arrays/arrays-overview).
]

---

## Vectors

* Given the vector:
 <pre>
 v = [1, 2, 3, 4, 5]
 </pre>
* To access the values of `v` using one of its **valid positions**, we will
 use the notation `v[i]`, where `i` is a **valid position**, for instance:
 <pre>
 print(v[0])
 v[0]+v[1]
 i=4
 v[i]
 v[i-1]
 </pre>
* Vectors are containers for sequences of values contiguous in memory. 
 &nbsp;&nbsp;
* Filling up vectors with values enables re-using algorithms that operate
 on the vector space. This requires looping over valid positions of a vector.

---

## Looping over valid positions

* Using an iterative statement, we generate a sequence of non-negative integer
  numbers that enable looping over the **valid positions** of the vector, and
  consequently the access to their associated values.

.left-column[
 <pre>
 v = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
 i = 0
 s = 0

while (i < 10) :
 &nbsp;&nbsp;s = s + v[i]
 &nbsp;&nbsp;i = i + 1

print(s)
 </pre>
]

.right-column[
  &nbsp;&nbsp;  
  &nbsp;&nbsp;
  ![](data:image/png;base64,#img/sumvector1to10.png)
]

---

## Looping over valid positions

* Assume we have in the vector `v` a DNA sequence and we want to count
 how many nucleotides `T` we have in this sequence.
<pre>
v = ['A', 'T', 'T', 'G', 'C', 'C', 'T', 'A']
i = 0
n = 0
&nbsp;&nbsp;
while (i < 8) :
&nbsp;&nbsp;if (v[i] == 'T') :
&nbsp;&nbsp;&nbsp;&nbsp;n = n + 1
&nbsp;&nbsp;i = i + 1
&nbsp;&nbsp;
print(f"there are {n} nucleotides T")
</pre>
* Note that, in Python, character literals are enclosed by single quotes,
 i.e., `'T'`, while character string literals are enclosed by double
 quotes `"there are ..."`. 
 &nbsp;&nbsp;
* The argument in the call to `print()` is a
 [formatted string literal](https://docs.python.org/3/tutorial/inputoutput.html#tut-f-strings),
 which let us include the value of variables inside a string.

---

class: small-table

## Compound conditionals

* Let's say we want to count the number of dinucleotides `TT` in the
  DNA sequence. This requires comparing two consecutive positions in the
  vector, both of which should have the nucleotide `T`.  
  &nbsp;&nbsp;
--

* We can implement such a logic by nesting
  [conditional statements](https://en.wikipedia.org/wiki/Conditional_%28computer_programming%29).
  However, we can write more compact code with compound conditionals using
  [logical operators](https://en.wikipedia.org/wiki/Logical_connective).  
  &nbsp;&nbsp;

--
* In Python we have the following three logical operators:

| Operator | Type | Description |
|-------------------- | -------------- | ------------------------------|
| _cond1_ `and` _cond2_ | conjunction | True if both operands are true|
| _cond1_ `or` _cond2_ | disjunction | True if either operand is true|
| `not` _cond_ | negation | True if operand is false |
* Here _cond_, _cond1_ and _cond2_ refer to logical conditions such as:
 <pre>
 i < 10
 v[i] == 'T'
 </pre>

---

## Compound conditionals

* We can implement the Python program that counts dinucleotides `TT` as follows:
<pre>
v = ['A', 'T', 'T', 'G', 'C', 'C', 'T', 'A']
i = 1
n = 0
&nbsp;
while (i < len(v)) :
&nbsp;&nbsp;if (v[i] == 'T' and v[i-1] == 'T') :
&nbsp;&nbsp;&nbsp;&nbsp;n = n + 1
&nbsp;&nbsp;i = i + 1
&nbsp;
print(f"there are {n} dinucleotides TT")
</pre>
where the function `len(v)` returns the number of elements of the vector `v`. 
&nbsp;&nbsp;
--

* The previous implementation looks up whether a nucleotide is identical
  to the previous one in the vector, think about what would be change in
  the code to work looking up the next position, instead of the previous
  one.

---

## Concluding remarks (vectors)

* Vectors allow one to store multiple values in a single variable.  
  &nbsp;&nbsp;
* Values in a vector are accessed by their position in the vector.  
  &nbsp;&nbsp;
* **Valid positions** in a vector in Python start at 0 and consequently end
  at the number of elements minus one.  
  &nbsp;&nbsp;
* Looping over **valid positions** enables developing algorithms that
  can be re-used by replacing the values of the vector.  
  &nbsp;&nbsp;
* When accessing simultaneously (e.g., for comparison) multiple positions
  in a vector, i.e., `v[i+1], v[i+2], `etc., care must be taken to avoid
  accessing positions outside the valid range.  
  &nbsp;&nbsp;
* Vectors in programming are analogous to vectors in
  [mathematics and physics](https://en.wikipedia.org/wiki/Vector_%28mathematics_and_physics%29).

---
class: center, middle, inverse

# Built-in data types and object classes

---

## Python built-in data types

* What we have called _vectors_ so far are technically called _lists_ in
 Python. 
 &nbsp;&nbsp;
* A _list_ in Python may contain values of different types:
 <pre>
 v = [4, 3.2, "Hello World!", True]
 </pre>
* A _list_ in Python is one of the _built-in_ data types, concretely those
 that can be classified as _sequence data types_:
 * `list`
 <pre>
 v = [4, 3.2, 'Hello World!', True] ## mutable, can change
 </pre>
 * `tuple`: (4, 3.2, 'Hello World!', True)
 <pre>
 v = (4, 3.2, 'Hello World!', True) ## immutable, cannot change
 </pre>
 * `range`: range(5)
 <pre>
 v = range(5) ## sequence of integer numbers from 0 to 5
 </pre>

---

## Python built-in data types

* The whole collection of Python
[built-in data types](https://docs.python.org/3/library/stdtypes.html) is:
 * Text sequence type: `str`
 * Numeric types: `int`, `float`, `complex`
 * Sequence types: `list`, `tuple`, `range`
 * Mapping type: `dict`
 * Set types: `set`, `frozenset`
 * Boolean type: `bool`
 * Binary types: `bytes`, `bytearray`, `memoryview`
* We can figure out the data type of a Python object using the function
 `type()`:
 <pre>
 >>> type("Hello World!")
 &lt;class 'str'&gt;
 >>> type(4)
 &lt;class 'int'&gt;
 >>> type(3.2)
 &lt;class 'float'&gt;
 >>> type(True)
 &lt;class 'bool'&gt;
 </pre>

---

## Extending data types through object classes

* You can extend the available data types by using so-called
 [object classes](https://docs.python.org/3/tutorial/classes.html).
 &nbsp;&nbsp;
* Let's define a new _point_ data type:
 <pre>
 class Point:
 &nbsp;&nbsp;def __init__(self, x, y):
 &nbsp;&nbsp;self.x = x
 &nbsp;&nbsp;self.y = y
 &nbsp;&nbsp;def __repr__(self):
 &nbsp;&nbsp;return f"({self.x:.1f}, {self.y:.1f})"
 </pre>
* We can now use this new _point_ data type in our Python code:
 <pre>
 >>> pt = Point(3, 4)
 >>> type(pt)
 &lt;class '__main__.Point'&gt;
 >>> pt
 (3.0, 4.0)
 </pre>
* This is part of the so-called
 [object-oriented programming](https://en.wikipedia.org/wiki/Object-oriented_programming)
 paradigm based on the concepts of
 [abstraction](https://en.wikipedia.org/wiki/Abstraction_%28computer_science%29),
 [encapsulation](https://en.wikipedia.org/wiki/Encapsulation_%28computer_programming%29),
 [inheritance](https://en.wikipedia.org/wiki/Inheritance_%28object-oriented_programming%29)
 and
 [polymorphism](https://en.wikipedia.org/wiki/Polymorphism_%28computer_science%29).

---
class: center, middle, inverse

# Functions

---

## Bundling lines together into functions

* Programming instructions performing a specific task, such as the calculation
 of a particular value or decision, can be bundled together under a so-called
 [function](https://en.wikipedia.org/wiki/Subroutine). 
 &nbsp;&nbsp;
* Functions may take input arguments and may return output values. 
 &nbsp;&nbsp;
* A [Python function](https://docs.python.org/3/tutorial/controlflow.html#defining-functions)
 is defined as follows:
 <pre>
 def sum(a, b):
 &nbsp;&nbsp;c=a+b
 &nbsp;&nbsp;return c
 </pre> 
 which we can call then as follows (example calling it from the Python interpreter):
 <pre>
 >>> sum(3, 4)
 7
 </pre>

---

## Bundling lines together into functions

* We can write functions taking any class of object as parameters, for instance,
 from the previously defined _Point_ class to calculate the
 [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) between
 two points:
 <pre>
 def edist(pt1, pt2):
 &nbsp;&nbsp;dx=pt1.x-pt2.x ## distance of 'x' between pt1 and pt2
 &nbsp;&nbsp;dy=pt1.y-pt2.y ## distance of 'y' between pt1 and pt2
 &nbsp;&nbsp;## the Euclidean distance between pt1 and pt2 is
 &nbsp;&nbsp;## the square root of the sum of squares of the distances
 &nbsp;&nbsp;ed = (dx**2+dy**2)**0.5
 &nbsp;&nbsp;return ed
 </pre>
 which we can call then as follows:
 <pre>
 >>> pt2 = Point(5, 7)
 >>> edist(pt, pt2)
 3.605551275463989
 </pre>

---

## Bundling lines together into functions

* Functions can be more complex. Let's write a function that calculates the
 arithmetic mean of the numerical values stored in a vector:
 <pre>
 def mean(v):
 &nbsp; s = 0
 &nbsp; i = 0
 &nbsp; while i < len(v):
 &nbsp;&nbsp;&nbsp;&nbsp; s = s + v[i]
 &nbsp;&nbsp;&nbsp;&nbsp; i = i + 1
 &nbsp;
 &nbsp; mean = s / len(v)
 &nbsp; return mean
 </pre>
 &nbsp;&nbsp;
 <pre>
 >>> mean([1,2,3])
 2.0
 </pre>

---

## Bundling lines together into functions

* Functions can be even
 [recursive](https://en.wikipedia.org/wiki/Recursion_%28computer_science%29).
 For instance, to compute the sum of a
 [Fibonacci sequence](https://en.wikipedia.org/wiki/Fibonacci_number).
 <pre>
 def Fibonacci(n):
 &nbsp;&nbsp; if n == 0:
 &nbsp;&nbsp;&nbsp;&nbsp; return 0
 &nbsp;&nbsp; elif n == 1 or n == 2:
 &nbsp;&nbsp;&nbsp;&nbsp; return 1
 &nbsp;&nbsp; else:
 &nbsp;&nbsp;&nbsp;&nbsp; result = Fibonacci(n-1) + Fibonacci(n-2)
 &nbsp;&nbsp;&nbsp;&nbsp; return result
 </pre>
 <pre>
 >>> Fibonacci(9)
 34
 </pre>
* Here we used the Python keyword `elif` to stack more than one alternative
 condition to the `if` statement.

---
class: center, middle, inverse

# Modules

---

## Organizing code into modules

* A [Python module](https://docs.python.org/3/tutorial/modules.html) is a `.py`
 file containing code to be reused in other Python files. 
 &nbsp;&nbsp;
* For instance, let's say we store in a file called `point.py` the previous code:
 <pre>
 class Point:
 &nbsp;&nbsp;def __init__(self, x, y):
 &nbsp;&nbsp;&nbsp;&nbsp;self.x = x
 &nbsp;&nbsp;&nbsp;&nbsp;self.y = y
 &nbsp;&nbsp;def __repr__(self):
 &nbsp;&nbsp;&nbsp;&nbsp;return f"({self.x:.1f}, {self.y:.1f})"
 &nbsp;
 def edist(pt1, pt2):
 &nbsp;&nbsp;dx=pt1.x-pt2.x ## distance of 'x' between pt1 and pt2
 &nbsp;&nbsp;dy=pt1.y-pt2.y ## distance of 'y' between pt1 and pt2
 &nbsp;&nbsp;## the Euclidean distance between pt1 and pt2 is
 &nbsp;&nbsp;## the square root of the sum of squares of the distances
 &nbsp;&nbsp;ed = (dx**2+dy**2)**0.5
 &nbsp;&nbsp;return ed
 </pre>

---

## Organizing code into modules

* The previous file `point.py` can be reused as a _module_ as follows:
 <pre>
 >>> import point as pt
 >>> pt1 = pt.Point(1.7, 2.8)
 >>> pt2 = pt.Point(3.2, 2.1)
 >>> pt.edist(pt1, pt2)
 1.6552945357246849
 </pre>
* The `import` statement is loading the code of the `point.py` file
and storing it under the prefix `pt`. 
 &nbsp;&nbsp;
* Modules can reuse other modules and while it is not mandatory, it is
 considered a good practice to place all the `import` statements at the
 beginning of a file. 
 &nbsp;&nbsp;
* It is also possible to bypass the use of a prefix using `from` _module_
 `import *`:
 <pre>
 >>> from point import *
 >>> pt = Point(1.7, 2.8)
 >>> pt
 (1.7, 2.8)
 </pre>

---

## Using modules

* You can import a module following any of the following syntax:
 <pre>
 >>> import &lt;module_name&gt;
 >>> import &lt;module_name&gt; as &lt;alias&gt;
 >>> from &lt;module_name&gt; import &lt;entity_name&gt;
 </pre>
* For example, three ways to use the `sqrt` function from the `math` package:
 <pre>
 >>> import math
 >>> math.sqrt(3)
 3.0
 </pre>
 <pre>
 >>> import math as m
 >>> m.sqrt(3)
 3.0
 </pre>
 <pre>
 >>> from math import sqrt
 >>> sqrt(3)
 3.0
 </pre>

---

## Installing modules as packages

* Modules that provide classes and functions to be reused, but that are not
 supposed to be run as standalone applications are called _libraries_ or
 [_packages_](https://docs.python.org/3/tutorial/modules.html#packages). 
 &nbsp;&nbsp;
* Commonly used Python packages:
 * `matplotlib`: for plotting.
 * `math`: for mathematical operations (installed by default).
 * `numpy`: scientific computation.
 * `sys`: system-specific parameters and functions (installed by default).
 * `pandas`: data analysis and manipulation. 
* Many packages are part of the so-called
 [Python standard library](https://docs.python.org/3/library/index.html),
 installed by default. However, many others may need to be _installed_
 before they can be _used_. You can browse them at the
 [Python Package Index](https://pypi.org). There are different ways to install
 Python packages, one of them is using the `pip`, or `pip3`, command. For
 instance, this is how we would install the `numpy` and `matplotlib` packages:
 <pre>
 $ pip install numpy
 $ pip install matplotlib
 </pre>

---

## Using modules as packages

* Once installed, this is how we could use the packages `numpy` and
 `matplotlib`:
<pre>
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x), '-')
plt.plot(x, np.cos(x), '--');
</pre>

![:scale 70%](data:image/png;base64,#img/sin_cos.jpg)

---

## Concluding remarks (classes, functions, modules)

* We can make our Python code more modular by defining **functions**.  
  &nbsp;&nbsp;
* Python built-in data types can be extended by defining new **classes**
  of objects.  
  &nbsp;&nbsp;
* Python code developed for a particular purpose can be bundled
  together into **modules**.  
  &nbsp;&nbsp;
* Modules that provide classes and functions to be reused, but that are not
  supposed to be run as standalone applications are called 
  **packages** or **libraries**.

---
class: center, middle, inverse

# Non-interactive execution

---

## Taking arguments from the Unix command line

* From the prefix we can find out the module's name invoking the global
 variable `__name__` as follows (e.g., with our previous `point.py` module):
 <pre>
 >>> pt.__name__
 'point'
 </pre>
* If we add the following code to the **end** of our module `point.py`:
 <pre>
 if __name__ == "__main__" :
 &nbsp;&nbsp;import sys ## for reading command-line arguments
 &nbsp;&nbsp;pt1 = Point(float(sys.argv[1]), float(sys.argv[2]))
 &nbsp;&nbsp;pt2 = Point(float(sys.argv[3]), float(sys.argv[4]))
 &nbsp;&nbsp;ed = edist(pt1, pt2)
 &nbsp;&nbsp;print(ed)
 </pre>
* We can run the previous Python module from the Unix command line as follows:
 <pre>
 $ python point.py 1.7 2.8 3.2 2.1
1.65529453572
 </pre>

---

## Concluding remarks (non-interactive execution)

* Being able to execute Python programs non-interactively allows one
  to automatize workflows involving the execution of multiple programs.  
  &nbsp;&nbsp;
* Using the `sys` module and the Python variable `__name__`, we can
  enable our Python programs to take arguments from the command line.  
  &nbsp;&nbsp;
* The `sys` module puts into a vector called `sys.argv` the arguments
  given in the command line.