Python Data Structures and String Manipulation
For those entirely unfamiliar with Python, reading an introductory article will help make sense of this one, which will cover features, methods, and caveats of working with strings, integers, variables, lists, dictionaries, and more. Otherwise, make sure Python is installed and open the REPL to follow along.
The Print Function
Written as print()
, it's the simplest way to output information to the user through a terminal. This information can consist of various data types, each of which have their own set of rules regarding how they can be used:
print('string') # "string"
Strings
Strings are a series of characters and require being enclosed in quotes. Both single and double quotes are valid, but they must be the same type:
'This is a valid string'
"This is also valid"
"This is not valid'
'This is also not valid"
Alternating quotes can exist as long as the outside quote type is not present on the inside:
"This isn't going to return an error"
'"This fully quoted string is also valid"'
'"This isn't a valid string"'
Escape Characters
Strings can contain the same quote type both inside and outside if the inner quote character is escaped. Character escaping involves a special character (often \
) to alter the interpretation of subsequent characters:
print('This isn\'t going to return an error')
print("\"This quoted string is escaped and also valid\"")
Rendering certain whitespace characters such as tab and newline can be simplified with the escape sequences \t
and \n
respectively:
# "This string is tab separated"
print('This\tstring\tis\ttab\tseparated')
# Line 1
# Line 2
print('Line 1\nLine 2')
Printing escape characters requires escaping the escape character, or using raw strings, which present contents in a literal manner by prepending r
or R
to the string:
print('This is a backslash: \\')
print('\\n') # "\n"
print('\\t') # "\t"
print(r'This renders \ directly')
print(r'Newline = \n') # "Newline = \n"
print(R'Tab = \t') # "Tab = \t"
String Concatenation
String concatenation, hence the name, allows for combining multiple strings into a single one. Statements containing exclusively strings do not require a delimiter, which will join elements sequentially as written, and is functionally equivalent to using +
. Delimiting with ,
concatenates based on the separator, which by default is a space, but can be reassigned with a keyword argument:
# All of these output "String1String2String3"
print('String1'"String2"'String3')
print('String1' 'String2' 'String3')
print('String1' + 'String2' + 'String3')
print(
'String1'
'String2'
'String3'
)
# "String1 String2 String3"
print('String1', 'String2', 'String3')
# "String1.String2.String3"
print('String1', 'String2', 'String3', sep=".")
# String1
# String2
# String3
print('String1', 'String2', 'String3', sep="\n")
The print()
function also supports integers and floating point numbers, including expressions that perform operations between them (which will not be covered in-depth here):
print(42.0) # 42.0
print(2 + 2) # 4
Notice that integers and floats are not quoted, as strings, integers, and floats are all different data types. Concatenating them requires abiding by their syntactic rules:
print(2 + 2) # 4
print('2' + '2') # "22"
print('2', 2) # "2 2"
print('2' + 2) # This returns an error due to differing data types
Variables
Variables are another data type used for storing and retrieving information of various data types. Their names are case sensitive, can include underscores, must otherwise be alphanumeric, and cannot start with a number:
var = 'value' # Assigns string "value" to the variable "var"
VAR = 0 # Full capitals should be reserved for constants
var_name = 1 # Snake Case naming convention
varName = 2 # Camel Case naming convention
VarName = 3 # Pascal Case naming convention
_ = 2 # Valid assignment
var1 = 1 # Valid assignment
1var = 0 # Invalid assignment
Print statements support the use of variables as well:
var = 2 + 2
num = var + 1
text = 'This string is being stored for later'
print(var) # 4
print(num) # 5
print(text) # "This string is being stored for later"
They can be reassigned, appended to, or deleted:
var = 'Test' # Assigns value
print(var) # "Test"
var += 'String' # Appends value
print(var) # "TestString"
var = 42 # Reassigns value
print(var) # 42
example = var = 1 # Performs multiple assignments
print(example) # 1
print(var) # 1
del var # Deletes variable
print(var) # Returns an error for undefined variable
Values can be assigned within an expression using :=
(informally called the walrus operator):
print(var := 'Example') # "Example"
print(var) # "Example"
Note that deleting a variable is not the same as removing its value. Python's null value is None
, which is not the same as setting it to False
or ""
:
var = None # Contains no value
var = False # Boolean, opposite of True
var = "" # Empty string
del var # Deletes variable entirely
F-Strings
Formally called formatted string literals, f-strings enable string interpolation (among other features). String concatenation separates each element by data type, whereas string interpolation mixes them directly into the string by prepending either f
or F
:
# This outputs "The answer is C"
var = 'C'
print(f'The answer is {var}')
# This outputs "2 + 2 = 4"
print(F'2 + 2 = {2 + 2}')
Since {}
is the escape character sequence for non-string data types in f-strings, another set of {}
is required to write braces literally:
print('The {} characters do not require escaping in strings')
print(f'The {{}} characters require escaping in f-strings')
print(f'Empty non-escaped {} characters in f-strings return an error')
Floats can be formatted to limit the number of decimal places (note that Python rounds this example):
var = 3.141592; f'{var:.3f}' # 3.142
Elements can be padded with a specified character and amount, but note that the amount specified includes the length of the element. This can be verified with the len()
function, which outputs the length of a given input:
var = 'Test'
len(var) # 4
f'{var:_<6}' # Adds 2 _ to the right of "Test"
f'{var:.>7}' # Adds 3 . to the left of "Test"
f'{var:-^12}' # Adds 4 - to each side of "Test"
Docstrings
Short for Documentation Strings, these support capturing multi-line input and are often used in place of #
comments, which only support a single line. Docstrings use triple quotes written as either """
or '''
, and do not require escaping internal quotes:
print(
"""Demonstrates docstrings.
'Single quoted string'
"Double quoted string"
""")
The \
character does not require being escaped unless it takes place at the end of a line, where it will be interpreted as an escaped newline:
print(
'''Demonstrates escapes.
\ is rendered literally
Escaped backslash: \\
This sentence will render \
as a single line
''')
Note that expressions are rendered in a literal manner because the contents of a docstring are contained in a single string, but like regular strings, both f-strings and raw strings are supported:
print(f"""{2+2}""") # 4
print(r"""\t\n""") # "\t\n"
print("""\t\n""") # Renders tab and newline whitespaces
Tuples
These are an immutable collection of ordered elements indexed as sequential integers. Elements can consist of any data type and use ,
as a separator. Tuples are captured with a set of ()
, and retrieving its indices requires a set of []
:
var = ('string', 42.0, 2 + 2, __name__, False)
len(var) # 5
print(var) # ('string', 42.0, 4, '__main__', False)
Note that tuples can consist of a single element, but the ,
separator is still required to register it as one and can be verified with the type()
function:
var = ('input')
type(var) # <class 'str'>
var = ('input',)
type(var) # <class 'tuple'>
The count()
function returns the number of times a single input appears in the tuple. Note that equivalent integers and floats are considered the same:
var = (3, 2, 1, 1.0, 0)
var.count(1) # 2
Indices start from 0
, but can be retrieved from the end using negative integers:
var = ('tuples', 'are', 'groups')
var[0] # "tuples"
var[1] # "are"
var[2] # "groups"
var[-1] # "groups"
var[-2] # "are"
var[-3] # "tuples"
Strings can also be treated as tuples as each letter is indexed individually:
'string'[0] # "s"
'string'[1] # "t"
'string'[2] # "r"
'string'[3] # "i"
'string'[4] # "n"
'string'[5] # "g"
The syntax [start:end:interval]
can be used to retrieve multiple indices:
# Outputs start/end range
'string'[0:3] # "str"
'string'[1:4] # "tri"
'string'[2:-1] # "rin"
# Outputs every 3 indices from index 1 to 8
var = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
var[1:8:3] # (1, 4, 7)
Index operators have default values that vary by the position/type:
var = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
var[:3] == var[0:3]
var[2:] == var[2:len(var)]
var[1:7:] == var[1:7:1]
# ∴
var[::] == var[0:len(var):1] == var
Though not a tuple, the range()
function is analogous to a sequential integer generator that can be retrieved similarly to a tuple.
It supports the same operations as index retrieval with range(start, end, interval)
:
range(0, 10) == range(10)
var = range(4, 20, 2)
len(var) # 8
print(var[0]) # 4
print(var[1]) # 6
print(var[2]) # 8
print(var[3]) # 10
print(var[4:9]) # range(12, 20, 2)
print(var[4:9][3]) # 18
Lists
Both tuples and lists are collections of ordered elements indexed as sequential integers starting from 0
, can consist of any data type, and use ,
as a separator. The difference is that lists are mutable and use a set of []
for both capturing and retrieving indices:
var = ['string', 42.0, 2 + 2, __name__, False]
len(var) # 5
print(var) # ['string', 42.0, 4, '__main__', False]
Lists also share the same index retrieval logic as tuples, but additionally support index reassignment:
var = [1, 2, 3]
var[1] = 'b'
print(var) # [1, 'b', 3]
# Nested list reassignment
var = [[1, 2, 3], 'a', 'b', 'c']
var[0][2] = 'd' # [[1, 2, 'd'], 'a', 'b', 'c']
Index ranges can be reassigned altogether. The assigned value must be the same length or longer than the source range:
var = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
var[2:5] # [2, 3, 4]
len(var[2:5]) # 3
var[2:5] = ['a', 'b', 'c', 'd']
print(var) # [0, 1, 'a', 'b', 'c', 'd', 5, 6, 7, 8, 9]
When reassigning with a specified slice count (interval total), the length of the source index range and new value must match:
var = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
var[0:7:2] # [0, 2, 4, 6]
len(var[0:7:2]) # 4
var[0:7:2] = ['a', 'b', 'c', 'd']
print(var) # ['a', 1, 'b', 3, 'c', 5, 'd', 7, 8, 9]
# Both of these return an error
var[0:7:2] = ['a', 'b', 'c']
var[0:7:2] = ['a', 'b', 'c', 'd', 'e']
Python also has functions for manipulating lists:
-
append()
- Equivalent to
list[len(list):] = [value]
- Accepts only 1 input
- Input does not have to be a list
- Adds input as a new index to the end of the list
- Equivalent to
-
count()
- Accepts only 1 input
- Returns the number of times the input appears in the list
-
extend()
- Equivalent to
list[len(list):] = [val1, val2]
- Input can be a list or ungrouped string
- Input cannot be an ungrouped integer/float
- Appends all inputs as individual indices to the list
- Equivalent to
-
insert()
- Inserts an element at a specified index position
-
reverse()
- Inverts order of elements
- Modifies the original list immediately with new ordering
-
sort()
- Organises letters A > a to Z > z
- Organises numbers from least to greatest without preference to equivalent floats and integers
- Supports descending order with
list.sort(reverse=True)
- Modifies the original list immediately with new ordering
-
remove()
- Removes the first instance of a specified value in the list
-
pop()
- Removes a single specified index
- Defaults to last index if no input is provided
- Automatically outputs value of removed index
-
del
- Supports
list[start:end:interval]
syntax - Deletes list variable entirely when range is unspecified
- Supports
-
clear()
- Equivalent to
del list[:]
- Removes all indices from the list
- Equivalent to
var = [0, 1, 2]
var.insert(2, 0) # Inserts integer "0" at position 2
print(var) # [0, 1, 0, 2]
var.sort(reverse=True)
print(var) # [2, 1, 0, 0]
var.count(0) # 2
var.remove(0)
print(var) # [2, 1, 0]
var.reverse()
print(var) # [0, 1, 2]
var.append(3)
print(var) # [0, 1, 2, 3]
var.append([4, 5, 6])
print(var) # [0, 1, 2, 3, [4, 5, 6]]
var.extend([7, 8, 9])
print(var) # [0, 1, 2, 3, [4, 5, 6], 7, 8, 9]
del var[3:6]
print(var) # [0, 1, 2, 8, 9]
var.pop(3) # 8
print(var) # [0, 1, 2, 9]
var.pop() # 9
print(var) # [0, 1, 2]
var.clear()
print(var) # []
del var
print(var) # Returns error due to undefined variable
Dictionaries
Sometimes called associative arrays, these also follow the logic of creating a list of key-value pairs callable by an index (key) that retrieves a corresponding value. The primary difference is that lists index values with sequential integers, whereas dictionaries support custom index names (albeit limited to integers, floats, and strings):
var = ['a', 'b', 'c']
var[0] # "a"
var[1] # "b"
var[2] # "c"
dictionary = {0: 'a', 1: 'b', 2: 'c'}
dictionary[0] # "a"
dictionary[1] # "b"
dictionary[2] # "c"
# ∴
var[0] == dictionary[0]
var[1] == dictionary[1]
var[2] == dictionary[2]
var = {
'Spongebob': 'Squarepants',
'Patrick': 'Star',
'Squidward': 'Tentacles',
'Eugene': 'Krabs',
'Sandy': 'Cheeks'
}
var['Spongebob'] # "Squarepants"
var['Patrick'] # "Star"
var['Squidward'] # "Tentacles"
var['Eugene'] # "Krabs"
var['Sandy'] # "Cheeks"
The built-in dict()
function serves as a method of creating dictionaries with a variable-esque syntax:
# This is equivalent to the previous example
var = dict(
Spongebob = 'Squarepants',
Patrick = 'Star',
Squidward = 'Tentacles',
Eugene = 'Krabs',
Sandy = 'Cheeks'
)
Value pairs can also be retrieved using the get()
function, with the added benefit of optionally providing a custom return value for missing keys rather than an error.
var = {1: 'a', 2: 'b', 3: 'c'}
var[1] == var.get(1)
var[2] == var.get(2)
var[3] == var.get(3)
var[0] # KeyError
var.get(0) # None
var.get(0, 'x') # "x"
var.get(1, 'x') # "a"
Like lists, there cannot be duplicate index (key) entries, though value pairs do not have the same stipulation. Any repeating keys will replace the prior copy's value. Note that integers and quoted numbers are not the same, as the latter is a string, however a float and an equivalent integer will register as a duplicate:
var = {
1: 'a',
'1': 'a',
1.0: 'c',
}
print(var) # {1: 'c', '1': 'a'}
The pop()
function can be used to remove a key-value pair from a dictionary by specifying the key. Note that unlike with lists, pop()
does not assume the final index when no input is provided:
var = {'x': 0, 'y': 1}
var.pop('x') # 0
print(var) # {'y': 1}
Both keys and values can be reassigned and appended to an existing dictionary:
var = {'a': 1.0}
# Value reassignment
var['a'] = 'value'
print(var) # {'a': 'value'}
# Key reassignment
var['key'] = var.pop('a')
print(var) # {'key': 'value'}
# New key-value pair assignment
var['x'] = 42
print(var) # {'key': 'value', 'x': 42}
Multiple key-value pairs can be simultaneously modified or appended using the update()
function:
var = {1: 'a', 2: 'b', 3: 'c'}
var.update({1: 'x', 2: 'y'})
print(var) # {1: 'x', 2: 'y', 3: 'c'}
var.update({4: 'b', 5: 'a'})
print(var) # {1: 'x', 2: 'y', 3: 'c', 4: 'b', 5: 'a'}
The values()
function can be used to retrieve all value pairs from a dictionary:
var = {1: 'a', 2: 'b', 3: 'c'}
var.values() # dict_values(['a', 'b', 'c'])
Indices of value pairs can be retrieved with the same syntax as a nested tuple or list:
var = {'key': 'value'}
var['key'][:3] # "val"
Sets
Sets are an unordered collection of heterogenous integers, floats, and strings. Like dictionaries, duplicate values are discarded:
var = {'y', 2.0, 2, 'x', 1}
print(var) # {'x', 1, 2.0, 'y'}
The unordered nature of sets prevents them from being indexable or having re-assignable values, but they are mutable through the add()
and pop()
functions:
var = {1, 2, 3}
var.add('x') # Supports only 1 alphanumeric input
print(var) # {'x', 1, 2, 3}
var.pop() # Returns and removes a random value from the set
Data Type Conversions
Python's standard library includes various functions for converting data types to one another. For example, float()
will convert integers and congruent strings to a number with a decimal:
float(42) # 42.0
float('42') # 42.0
float('42.0') # 42.0
int()
truncates the value of floating point numbers and encodes strings congruent to an integer:
int(42.7) # 42
int('42') # 42
tuple()
separates all elements of an input to a proper tuple. Note that integers and floats must be entered as a list, set or nested tuple:
tuple('string') # ('s', 't', 'r', 'i', 'n', 'g')
tuple({2, 3.0, 1}) # (1, 2, 3.0)
tuple((42, 3)) # (42, 3)
tuple([42]) # (42,)
tuple(range(4, 20, 4)) # (4, 8, 12, 16)
var = {1: 'a', 2: 'b', 3: 'c'}
tuple(var) # (1, 2, 3)
tuple(var)[1] # 2
tuple(var.values()) # ('a', 'b', 'c')
tuple(var.values())[1] # "b"
list()
separates all elements of an input to a proper list. Note that integers and floats must be entered as a tuple, set, or nested list:
list('string') # ['s', 't', 'r', 'i', 'n', 'g']
list({2, 3.0, 1}) # [1, 2, 3.0]
list([42, 3]) # [42, 3]
list((42,)) # [42]
list(range(4, 20, 4)) # [4, 8, 12, 16]
var = {1: 'a', 2: 'b', 3: 'c'}
list(var) # [1, 2, 3]
list(var)[1] # 2
list(var.values()) # ['a', 'b', 'c']
list(var.values())[1] # "b"
set()
separates strings, integers, and floats to a proper set, with numbers requiring entry as a list, tuple, or nested set:
set('string') # {'n', 's', 'r', 'g', 't', 'i'}
set([1, 2.0, 3]) # {1, 2.0, 3}
set({42, 3}) # {42, 3}
set((42,)) # {42}
set(range(4, 20, 4)) # {4, 8, 12, 16}
var = {1: 'a', 2: 'b', 3: 'c'}
set(var) # {1, 2, 3}
set(var.values()) # {'a', 'b', 'c'}
str()
interprets input values as a literal string:
2 + 2 # 4
str(2) + str(2) # "22"
str([1, 2])[:3] # "[1,"
str({1: 'a'})[-6:] # ": 'a'}"
join()
uses the 'delimiter'.join(input)
syntax to concatenate string inputs (individual or a tuple/list/set) by a delimiter:
' '.join('string') # 's t r i n g'
var = list('string')
'.'.join(var) # "s.t.r.i.n.g"
var = tuple('string')
'-'.join(var) # "s-t-r-i-n-g"
var = set('string')
''.join(var) # 'rgtnsi' (order may vary)
split()
separates strings by an input delimiter, which is any form of whitespace by default, and supports specifying a slice count based on the amount of delimiter instances:
var = 'this is a string'
var.split() # ['this', 'is', 'a', 'string']
var.split(' ', 2) # ['this', 'is', 'a string']
var.split(' ', 1) # ['this', 'is a string']
var.split('i') # ['th', 's ', 's a str', 'ng']
var = """
Line 1
Line 2
"""
var.split('\n') # ['', 'Line 1', 'Line 2', '']
Not to be confused with list.sort()
, the sorted()
function organises iterable data types to a list in ascending order by default. Note that strings and numbers cannot be sorted together, and custom sorting methods are supported:
var = 'string'
sorted(var) # ['g', 'i', 'n', 'r', 's', 't']
sorted(var, reverse=True) # ['t', 's', 'r', 'n', 'i', 'g']
var = {'string', 'element', 'text'}
sorted(var) # ['element', 'string', 'text']
sorted(var, key=len) # ['text', 'string', 'element']
sorted((7, 1, 5)) # [1, 5, 7]
String Functions
A set of unambiguously named functions also exist within the standard library for strings. Similar to the list-specific ones, these are appended to the string or data type storing it.
Note that swapcase()
is the only function affected by the capitalisation of the input:
'String Text'.swapcase() # "sTRING tEXT"
'string text'.upper() # "STRING TEXT"
'STRING TEXT'.lower() # "string text"
'STRING TEXT'.title() # "String Text"
'STRING TEXT'.capitalize() # "String text"
The strip()
function removes all leading and trailing whitespaces from a string:
var = \
"""
string
text
""" # "\nstring\n\ntext\n"
var.strip() # "string\n\ntext"
' text '.strip() # "text"
'\ttext\t'.strip() # "text"
'\ntext\n'.strip() # "text"
# ∴
' text '.strip() == '\ttext\t'.strip() == '\ntext\n'.strip() == 'text'
replace()
uses the 'string'.replace('pattern', 'replacement', instances)
syntax to exchange existing string patterns for a new one:
var = 'abc abc abc'
var.replace(' ', '-') # "abc-abc-abc"
var.replace('ab', '12') # "12c 12c 12c"
var.replace('bc', '', 2) # "a a abc"
Other Methods
This article is intended to be introductory, and a plethora of native Python tools exist for regular expressions, manipulating file paths, parsing JSON data, serialising XML files, and more that deserve explanation in their own article(s). Feel free to research these subjects elsewhere until they've been effectively covered on this website.