Manipulating Strings in Python

Dikirimkan pada - Kali Terakhir Diubah Suai pada

Strings are the most widely used data types in every programming language. Why? Because we, humans, understand text better than numbers, so as in writing and talking we use text and words in programming too. With computer programs, we parse text, analyze text semantics, and do data mining -- and all this data is (mostly) human-consumed text. These are only small parts of Computer Science where strings and string manipulation is used.

As in many other programming languages (Java, C# and so on), the string data type in Python is immutable. This means working with strings can be slow and inefficient. Luckily, Python is a programming language which has “batteries included” and there are a lot of helpful string manipulation methods implemented. Only a few are presented here, but on the string methods Python help page all the methods are listed along with detailed explanation and good examples.

Basics

In Python, strings can be marked in multiple ways, using ‘ or “ or even “”” in case of multiline strings.

# string declaration
my_str1 = 'simple text'
my_str2 = "simple text 2"
my_multiline_str = """Hey, this 
is a multiline
string in python"""
my_unicode_str = "This is a unicode string containing some special characters like:éáő"

print(my_str1)
print(my_str2)
print(my_multiline_str)
print(my_unicode_str) # unicode strings are marked with u"string value"

Since Python 3.x, all strings have Unicode encoding. In Python 2.x, strings have to be created using the u prefix to be considered Unicode or use the unicode() type:

my_p2_unicode_str = u"This a python 2.x unicode string."
my_p2_unicode_str2 = unicode("This a python 2.x unicode string.")

Unicode strings can be converted to UTF-8 encoding using the encode() function:

my_utf8_str = my_p2_unicode_str.encode("utf-8")
my_ascii_str = my_p2_unicode_str.encode("ascii")

Common methods for manipulating strings

Concatenating and multiplying strings

String concatenation is very common and is used day by day. In Python strings can be concatenated using the + operator. Please keep in mind since in Python strings are immutable, when concatenating two strings there is always a third one created, the content of the other two is copied to this new string and that is considered the concatenated value. The + operator only works for two variables of type string. If you try to add a string and an integer, it will throw a TypeError, saying Can’t convert ‘int’ object to str implicitly. To bypass this objects can be converted to string using the srt() method.

# concatenation and multiplication of strings
start_str1 = "The quick brown fox jumps..."
start_str2 = "over the lazy dog "
concatenated_string = start_str1 + start_str2
print(concatenated_string) # will print The quick brown fox jumps...over the lazy dog

list_str = str([1,2,3])
print(list_str) # will print - "[1, 2, 3]"

Sometimes there is a need to create a string from a list of items joined by some extra text or character, for this the join() method is a great solution. Since the items in the iterable passed to the join method have to be strings, first it’s a good practice to do a conversion to string using str().

joined_str = "|".join(str(x) for x in [3,2,1])
print(joined_str) # will print - 3|2|1

In this example I joined the values 3 2 1 using a pipe, not a comma.

Python has another nice feature, so called multiplication of strings, which can be done using * operator.  

duplicate_str = start_str1 * 2 + start_str2 * 2
print(duplicate_str) # will print The quick brown fox jumps...The quick brown fox jumps...over the lazy dog over the lazy dog 

Searching and replacing characters or substrings

Strings have the find() and replace() methods which help to find a specific character or word in strings and eventually replace words or parts of the texts:

# string searching
str1 = "This is a sample string which I will use to showcase search methods of python"
print(str1.find("sample")) # will print 10 

str1 = str1.replace("e", "3") 
print(str1) # will print - This is a sampl3 string which I will us3 to showcas3 s3arch m3thods of python

String formatting

In Python 3.x formatting of strings has changed; it is more logical and is more flexible. Formatting can be done using the format() method or the % sign(old style) in format strings. Numbers have the C style formatting (d for integers, f for floats, .x (where x represents the number of precisions which should be displayed):

# formatting numbers
s1 = "{0:d} is a number".format(133)
print(s1) # will print - 133 is a number

s2 = "{0:f} is a number".format(133.987)
print(s2) # will print - 133.987000 is a number

s3 = "{0:.2f} is a number".format(133.987)
print(s3) # will print - 133.99 is a number  --> notice the rounding

In case there are multiple parameters for the format string, these are marked using {0}, {1}, {2}… values, each specifying the index of the parameter from the format method. 

# format with multiple parameters
s4 = "{0} are red, {1} are pink, {2} smell good, but...".format("roses", "violets", "flowers")
print(s4) # will print - roses are red, violets are pink, flowers smell good, but...

In case indexing is inconvenient or there is a dictionary with the values needed for the format string, named parameters can be used too:

# named parameters and dictionary passing
full_name = "John Doe"
current_age = 34
s5 = "{name} is {age} years old.".format(name=full_name, age=current_age)
print(s5) # will print - John Doe is 34 years old.

vals = { "first_name" : "John", "last_name" : "Doe", "age":33 }
s6 = "{first_name} {last_name} is {age} years old.".format(**vals)
print(s6) # will print - John Doe is 33 years old.

 

Parsing CSV data using string methods

Here is a code sample which parses text in CSV format and creates a result as a dictionary, keys representing the columns and the values are the column values from CSV data.

First I process the headers of the columns using the split() method, splitting on “\n” – new line ascii code. After that I take each line and split the text by “,”. I use the strip() method to clear CSV data of extra whitespaces. You can save the code and execute it using python3 my_saved_file.py

def parse_csv(text):
    """Parses a CSV format text, returns a dictionary, keys are the Column headers from the CSV and values are the data"""
    result = {}
    
    # first split by EOL
    lines = text.split("\n")
    if len(lines) > 0:
        headers = [item.strip() for item in lines[0].split(",")]
                
        # rows
        for line in lines[1:]: # bypass first line, that contains the header
            row_items = [item.strip() for item in line.split(",")]
            counter = 0

            for header in headers:
                if header not in result:
                    result[header] = []

                result[header].append(row_items[counter])
                counter += 1

    return result



if __name__ == "__main__":
    sample_csv = """Column1, Column2, Column3
    1, 2, 3
    a, b, c
    I, V, X
    4, 5, 6
    d, e, f"""
    parsed_data = parse_csv(sample_csv)
    print(parsed_data) # will print {'Column1': ['1', 'a', 'I', '4', 'd'], 'Column3': ['3', 'c', 'X', '6', 'f'], 'Column2': ['2', 'b', 'V', '5', 'e']}

Making statistics of words

Here is another sample, which makes statistics of words appearing in a text. The split() method is used to split the text into words and a dictionary is created, where the keys are the words from the text, while the values are the occurrences of the words.

def count_words(text):
    """Makes a statistic of the words appearing in the text 
       and returns a dictionary where keys are the words and 
       values are the number of occurrence of each word."""
    words = text.split()
    result = {}
    for word in words:        
        if word in result:
            result[word] += 1
        else:
            result[word] = 1

    return result

if __name__ == "__main__":
    text = """In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). A string is generally understood as a data type and is often implemented as an array of bytes (or words) that stores a sequence of elements, typically characters, using some character encoding. A string may also denote more general arrays or other sequence (or list) data types and structures. Source [Wikipedia http://en.wikipedia.org/wiki/String_(computer_science)]"""
    print("Text which will be checked: {0}".format(text))
    stat = count_words(text)
    for key in sorted(stat):
        print("[{0}] appeared {1} times.".format(key, stat[key]))

String manipulation in Python is a huge topic, this article only presents bits and pieces, in case you are looking for some string manipulating python method, try to search for it, it’s almost 100% that you will find it in python standard library or in other libraries, already written by python users.

Source code can be accessed on GitHub gists:

Dipaparkan 6 Januari, 2015

Greg Bogdan

Software Engineer, Blogger, Tech Enthusiast

I am a Software Engineer with over 7 years of experience in different domains(ERP, Financial Products and Alerting Systems). My main expertise is .NET, Java, Python and JavaScript. I like technical writing and have good experience in creating tutorials and how to technical articles. I am passionate about technology and I love what I do and I always intend to 100% fulfill the project which I am ...

Artikel Seterusnya

Branding in the Digital Age: 4 Key Takeaways