class: title-slide, middle, center # Programming with Python (2) ## Robert Castelo [robert.castelo@upf.edu](mailto:robert.castelo@upf.edu) ### Dept. of Medicine & Life Sciences ### Universitat Pompeu Fabra <br> ## Fundamentals of Computational Biology ### BSc on Human Biology ### UPF School of Medicine and Life Sciences ### Academic Year 2024-2025 --- class: center, middle, inverse # Vectors and their range of valid positions --- ## Vectors * A vector, or [array](https://en.wikipedia.org/wiki/Array_data_structure) is a type of object (variable) that can store **more than one single value** and allows for an indexed access to its values. -- * We can set a literal vector into a Python variable using an assignment: <pre> v = [1, 2, 3, 4, 5] </pre> -- * We can access its values by referring to one of its **valid positions** in the vector. -- * Given a vector with `\(n\)` elements, the **valid positions** of a vector in Python (and in many other programming languages) go from 0 to `\(n-1\)`, that is, positions follow a [zero-based numbering](https://en.wikipedia.org/wiki/Zero-based_numbering). ![:scale 40%](data:image/png;base64,#img/dozeneggs.jpg) .footer[ Image adapted from [Comp 101 Arrays: Overview](https://comp101.org/topics/arrays/arrays-overview). ] --- ## Vectors * Given the vector: <pre> v = [1, 2, 3, 4, 5] </pre> * To access the values of `v` using one of its **valid positions**, we will use the notation `v[i]`, where `i` is a **valid position**, for instance: <pre> print(v[0]) v[0]+v[1] i=4 v[i] v[i-1] </pre> * Vectors are containers for sequences of values contiguous in memory. * Filling up vectors with values enables re-using algorithms that operate on the vector space. This requires looping over valid positions of a vector. --- ## Looping over valid positions * Using an iterative statement, we generate a sequence of non-negative integer numbers that enable looping over the **valid positions** of the vector, and consequently the access to their associated values. .left-column[ <pre> v = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] i = 0 s = 0 while (i < 10) : s = s + v[i] i = i + 1 print(s) </pre> ] .right-column[ ![](data:image/png;base64,#img/sumvector1to10.png) ] --- ## Looping over valid positions * Assume we have in the vector `v` a DNA sequence and we want to count how many nucleotides `T` we have in this sequence. <pre> v = ['A', 'T', 'T', 'G', 'C', 'C', 'T', 'A'] i = 0 n = 0 while (i < 8) : if (v[i] == 'T') : n = n + 1 i = i + 1 print(f"there are {n} nucleotides T") </pre> * Note that, in Python, character literals are enclosed by single quotes, i.e., `'T'`, while character string literals are enclosed by double quotes `"there are ..."`. * The argument in the call to `print()` is a [formatted string literal](https://docs.python.org/3/tutorial/inputoutput.html#tut-f-strings), which let us include the value of variables inside a string. --- class: small-table ## Compound conditionals * Let's say we want to count the number of dinucleotides `TT` in the DNA sequence. This requires comparing two consecutive positions in the vector, both of which should have the nucleotide `T`. -- * We can implement such a logic by nesting [conditional statements](https://en.wikipedia.org/wiki/Conditional_%28computer_programming%29). However, we can write more compact code with compound conditionals using [logical operators](https://en.wikipedia.org/wiki/Logical_connective). -- * In Python we have the following three logical operators: | Operator | Type | Description | |-------------------- | -------------- | ------------------------------| | _cond1_ `and` _cond2_ | conjunction | True if both operands are true| | _cond1_ `or` _cond2_ | disjunction | True if either operand is true| | `not` _cond_ | negation | True if operand is false | * Here _cond_, _cond1_ and _cond2_ refer to logical conditions such as: <pre> i < 10 v[i] == 'T' </pre> --- ## Compound conditionals * We can implement the Python program that counts dinucleotides `TT` as follows: <pre> v = ['A', 'T', 'T', 'G', 'C', 'C', 'T', 'A'] i = 1 n = 0 while (i < len(v)) : if (v[i] == 'T' and v[i-1] == 'T') : n = n + 1 i = i + 1 print(f"there are {n} dinucleotides TT") </pre> where the function `len(v)` returns the number of elements of the vector `v`. -- * The previous implementation looks up whether a nucleotide is identical to the previous one in the vector, think about what would be change in the code to work looking up the next position, instead of the previous one. --- ## Concluding remarks (vectors) * Vectors allow one to store multiple values in a single variable. * Values in a vector are accessed by their position in the vector. * **Valid positions** in a vector in Python start at 0 and consequently end at the number of elements minus one. * Looping over **valid positions** enables developing algorithms that can be re-used by replacing the values of the vector. * When accessing simultaneously (e.g., for comparison) multiple positions in a vector, i.e., `v[i+1], v[i+2], `etc., care must be taken to avoid accessing positions outside the valid range. * Vectors in programming are analogous to vectors in [mathematics and physics](https://en.wikipedia.org/wiki/Vector_%28mathematics_and_physics%29). --- class: center, middle, inverse # Built-in data types and object classes --- ## Python built-in data types * What we have called _vectors_ so far are technically called _lists_ in Python. * A _list_ in Python may contain values of different types: <pre> v = [4, 3.2, "Hello World!", True] </pre> * A _list_ in Python is one of the _built-in_ data types, concretely those that can be classified as _sequence data types_: * `list` <pre> v = [4, 3.2, 'Hello World!', True] ## mutable, can change </pre> * `tuple`: (4, 3.2, 'Hello World!', True) <pre> v = (4, 3.2, 'Hello World!', True) ## immutable, cannot change </pre> * `range`: range(5) <pre> v = range(5) ## sequence of integer numbers from 0 to 5 </pre> --- ## Python built-in data types * The whole collection of Python [built-in data types](https://docs.python.org/3/library/stdtypes.html) is: * Text sequence type: `str` * Numeric types: `int`, `float`, `complex` * Sequence types: `list`, `tuple`, `range` * Mapping type: `dict` * Set types: `set`, `frozenset` * Boolean type: `bool` * Binary types: `bytes`, `bytearray`, `memoryview` * We can figure out the data type of a Python object using the function `type()`: <pre> >>> type("Hello World!") <class 'str'> >>> type(4) <class 'int'> >>> type(3.2) <class 'float'> >>> type(True) <class 'bool'> </pre> --- ## Extending data types through object classes * You can extend the available data types by using so-called [object classes](https://docs.python.org/3/tutorial/classes.html). * Let's define a new _point_ data type: <pre> <font style="color: darkblue; font-weight: bold">class</font> Point: <font style="color: darkblue; font-weight: bold">def</font> __init__(self, x, y): self.x = x self.y = y <font style="color: darkblue; font-weight: bold">def</font> __repr__(self): <font style="color: darkblue; font-weight: bold">return</font> f"({self.x:.1f}, {self.y:.1f})" </pre> * We can now use this new _point_ data type in our Python code: <pre> >>> pt = Point(3, 4) >>> type(pt) <class '__main__.Point'> >>> pt (3.0, 4.0) </pre> * This is part of the so-called [object-oriented programming](https://en.wikipedia.org/wiki/Object-oriented_programming) paradigm based on the concepts of [abstraction](https://en.wikipedia.org/wiki/Abstraction_%28computer_science%29), [encapsulation](https://en.wikipedia.org/wiki/Encapsulation_%28computer_programming%29), [inheritance](https://en.wikipedia.org/wiki/Inheritance_%28object-oriented_programming%29) and [polymorphism](https://en.wikipedia.org/wiki/Polymorphism_%28computer_science%29). --- class: center, middle, inverse # Functions --- ## Bundling lines together into functions * Programming instructions performing a specific task, such as the calculation of a particular value or decision, can be bundled together under a so-called [function](https://en.wikipedia.org/wiki/Subroutine). * Functions may take input arguments and may return output values. * A [Python function](https://docs.python.org/3/tutorial/controlflow.html#defining-functions) is defined as follows: <pre> <font style="color: darkblue; font-weight: bold">def</font> sum(a, b): c=a+b <font style="color: darkblue; font-weight: bold">return</font> c </pre> which we can call then as follows (example calling it from the Python interpreter): <pre> >>> sum(3, 4) 7 </pre> --- ## Bundling lines together into functions * We can write functions taking any class of object as parameters, for instance, from the previously defined _Point_ class to calculate the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) between two points: <pre> <font style="color: darkblue; font-weight: bold">def</font> edist(pt1, pt2): dx=pt1.x-pt2.x ## distance of 'x' between pt1 and pt2 dy=pt1.y-pt2.y ## distance of 'y' between pt1 and pt2 ## the Euclidean distance between pt1 and pt2 is ## the square root of the sum of squares of the distances ed = (dx**2+dy**2)**0.5 <font style="color: darkblue; font-weight: bold">return</font> ed </pre> which we can call then as follows: <pre> >>> pt2 = Point(5, 7) >>> edist(pt, pt2) 3.605551275463989 </pre> --- ## Bundling lines together into functions * Functions can be more complex. Let's write a function that calculates the arithmetic mean of the numerical values stored in a vector: <pre> <font style="color: darkblue; font-weight: bold">def</font> mean(v): s = 0 i = 0 while i < len(v): s = s + v[i] i = i + 1 mean = s / len(v) <font style="color: darkblue; font-weight: bold">return</font> mean </pre> <pre> >>> mean([1,2,3]) 2.0 </pre> --- ## Bundling lines together into functions * Functions can be even [recursive](https://en.wikipedia.org/wiki/Recursion_%28computer_science%29). For instance, to compute the sum of a [Fibonacci sequence](https://en.wikipedia.org/wiki/Fibonacci_number). <pre> <font style="color: darkblue; font-weight: bold">def</font> Fibonacci(n): if n == 0: <font style="color: darkblue; font-weight: bold">return</font> 0 elif n == 1 or n == 2: <font style="color: darkblue; font-weight: bold">return</font> 1 else: result = Fibonacci(n-1) + Fibonacci(n-2) <font style="color: darkblue; font-weight: bold">return</font> result </pre> <pre> >>> Fibonacci(9) 34 </pre> * Here we used the Python keyword `elif` to stack more than one alternative condition to the `if` statement. --- class: center, middle, inverse # Modules --- ## Organizing code into modules * A [Python module](https://docs.python.org/3/tutorial/modules.html) is a `.py` file containing code to be reused in other Python files. * For instance, let's say we store in a file called `point.py` the previous code: <pre> <font style="color: darkblue; font-weight: bold">class</font> Point: <font style="color: darkblue; font-weight: bold">def</font> __init__(self, x, y): self.x = x self.y = y <font style="color: darkblue; font-weight: bold">def</font> __repr__(self): <font style="color: darkblue; font-weight: bold">return</font> f"({self.x:.1f}, {self.y:.1f})" <font style="color: darkblue; font-weight: bold">def</font> edist(pt1, pt2): dx=pt1.x-pt2.x ## distance of 'x' between pt1 and pt2 dy=pt1.y-pt2.y ## distance of 'y' between pt1 and pt2 ## the Euclidean distance between pt1 and pt2 is ## the square root of the sum of squares of the distances ed = (dx**2+dy**2)**0.5 <font style="color: darkblue; font-weight: bold">return</font> ed </pre> --- ## Organizing code into modules * The previous file `point.py` can be reused as a _module_ as follows: <pre> >>> import point as pt >>> pt1 = pt.Point(1.7, 2.8) >>> pt2 = pt.Point(3.2, 2.1) >>> pt.edist(pt1, pt2) 1.6552945357246849 </pre> * The `import` statement is loading the code of the `point.py` file and storing it under the prefix `pt`. * Modules can reuse other modules and while it is not mandatory, it is considered a good practice to place all the `import` statements at the beginning of a file. * It is also possible to bypass the use of a prefix using `from` _module_ `import *`: <pre> >>> from point import * >>> pt = Point(1.7, 2.8) >>> pt (1.7, 2.8) </pre> --- ## Using modules * You can import a module following any of the following syntax: <pre> >>> import <module_name> >>> import <module_name> as <alias> >>> from <module_name> import <entity_name> </pre> * For example, three ways to use the `sqrt` function from the `math` package: <pre> >>> import math >>> math.sqrt(3) 3.0 </pre> <pre> >>> import math as m >>> m.sqrt(3) 3.0 </pre> <pre> >>> from math import sqrt >>> sqrt(3) 3.0 </pre> --- ## Installing modules as packages * Modules that provide classes and functions to be reused, but that are not supposed to be run as standalone applications are called _libraries_ or [_packages_](https://docs.python.org/3/tutorial/modules.html#packages). * Commonly used Python packages: * `matplotlib`: for plotting. * `math`: for mathematical operations (installed by default). * `numpy`: scientific computation. * `sys`: system-specific parameters and functions (installed by default). * `pandas`: data analysis and manipulation. * Many packages are part of the so-called [Python standard library](https://docs.python.org/3/library/index.html), installed by default. However, many others may need to be _installed_ before they can be _used_. You can browse them at the [Python Package Index](https://pypi.org). There are different ways to install Python packages, one of them is using the `pip`, or `pip3`, command. For instance, this is how we would install the `numpy` and `matplotlib` packages: <pre> $ pip install numpy $ pip install matplotlib </pre> --- ## Using modules as packages * Once installed, this is how we could use the packages `numpy` and `matplotlib`: <pre> import matplotlib.pyplot as plt import numpy as np x = np.linspace(0, 10, 100) plt.plot(x, np.sin(x), '-') plt.plot(x, np.cos(x), '--'); </pre> ![:scale 70%](data:image/png;base64,#img/sin_cos.jpg) --- ## Concluding remarks (classes, functions, modules) * We can make our Python code more modular by defining **functions**. * Python built-in data types can be extended by defining new **classes** of objects. * Python code developed for a particular purpose can be bundled together into **modules**. * Modules that provide classes and functions to be reused, but that are not supposed to be run as standalone applications are called **packages** or **libraries**. --- class: center, middle, inverse # Non-interactive execution --- ## Taking arguments from the Unix command line * From the prefix we can find out the module's name invoking the global variable `__name__` as follows (e.g., with our previous `point.py` module): <pre> >>> pt.__name__ 'point' </pre> * If we add the following code to the **end** of our module `point.py`: <pre> if __name__ == "__main__" : import sys ## for reading command-line arguments pt1 = Point(float(sys.argv[1]), float(sys.argv[2])) pt2 = Point(float(sys.argv[3]), float(sys.argv[4])) ed = edist(pt1, pt2) print(ed) </pre> * We can run the previous Python module from the Unix command line as follows: <pre> $ python point.py 1.7 2.8 3.2 2.1 1.65529453572 </pre> --- ## Concluding remarks (non-interactive execution) * Being able to execute Python programs non-interactively allows one to automatize workflows involving the execution of multiple programs. * Using the `sys` module and the Python variable `__name__`, we can enable our Python programs to take arguments from the command line. * The `sys` module puts into a vector called `sys.argv` the arguments given in the command line.