• LOGIN
  • No products in the cart.

Python Programming: The csv and json Python Module by Rui Silva

Files are a big part of programming. We use them for a lot of things. HTML files have to be loaded when serving a web page. Some applications export files in some formats that we need to read in other applications or even we want to be the ones doing the exporting. In this article, we will learn some concepts to help us understand how to use files and also some advanced ways of making use of them.


Duck typing is a very common way of typing objects in Python. The name Duck Typing comes from the expression “If it walks like a duck, swims like a duck and quacks like a duck, it is a duck”. In program- ming languages this means that if an object is not of the type you desire but has the same methods then it must do the same thing. To understand this concept more in depth, we’ll be using Python’s built-in StringIO object.

StringIO is a file-like object that does not save files. This is very useful, for example, when you download a file from a web service but don’t need to store it. We can put the file in a StringIO object and it will behave exactly like an actual file (because StringIO has the same methods as file objects). Contrary to file objects, StringIO will only save the file’s contents to memory and not to disk (making it very fast when compared to actual files), with the down- side that they are temporary (which in some situations is exactly what we need).

When initialising a file, you always need to provide 2 ar- guments: a file path and a opening mode (the most used

modes are ‚r’ and ‚w’ for reading and writing respectively). With a StringIO we only need to instantiate one without any arguments to get an empty file. If you want to initialise it with content just pass a string as the first argument. For exam- ple, if we want to store the contents of https://google.com/ temporarily in memory to do something with it, we could do:

$ response = request.get(“https://google.com/”) $ google_content = StringIO(response.content)

From now on the variable `google_content` will behave like a file and can be passed to any library or package that expects a file. This is all due to duck-typing.

Opening and reading from files

Let’s practice opening and reading files. In this section I’ll try to show some quirks about opening files like “Universal newline” and such. First thing we need is a file. We can create a new empty file on disk by doing:

$ f = open(‘/home/path/to/file/file.txt’, ‘w’)

The mode ‘w’ indicates that we are opening the file for writing and if no file exists with the name and path pro- vided, one will be created. Note that if there is a file with the same name as the one you are trying to edit, it will be erased. If you want to append information to an exist- ing file, use the ‘a’ mode. Try it.

When you are done reading the data from the files, you should close the file by calling:

$ f.close()

This will release the file and free up any system resourc- es used by the opening of your file.

As of Python 2.5, a new statement was introduced to simplify this process: the with statement. This statement clarifies some code that previously would use try/finally blocks, so that it can be written in a more pythonic way. Using this, you can open a file and when you no longer use it, the file will be properly closed, even if some excep- tions are raised along the way, and the system resources will be freed. Here’s an example of the proper opening of a file:

with open(‘workfile’, ‘r’) as f: read_data = f.read()

CSV files and csvreader

Files can have many formats. One of the most common is CSV (comma separated values but you can also see TSV for tab separated values). The format of these files is very simple. The first row is either a comma separated val- ues of headers or directly data. The file we use is a CSV file. If you open the file, you can see that there is a header in the first line and the rest of the data follows.

Read

To read a CSV file, you need to use the CSV python mod- ule, therefore, it needs to be imported before you can use it (import csv). After that, and with an opened file, you can use the reader from the CSV module to create a reader, which can iterate over all the lines in the CSV file. Take a look at this example:

>>> import csv
>>> with open(‚csvfile.csv’, ‚rU’) as f:
... reader = csv.reader(f, delimiter=’,’,

dialect=’excel’)
... for row in reader: ... print row

...
[‚street’, ‚city’, ‚zip’, ‚state’, ‚beds’, ‚baths’, ‚sq__ft’,

‚type’, ‚sale_date’, ‚price’, ‚latitude’, ‚longitude’] [‚3526 HIGH ST’, ‚SACRAMENTO’, ‚95838’, ‚CA’, ‚2’, ‚1’,

‚836’, ‚Residential’, ‚Wed May 21 00:00:00 EDT 2008’,

‚59222’, ‚38.631913’, ‚-121.434879’]
[‚51 OMAHA CT’, ‚SACRAMENTO’, ‚95823’, ‚CA’, ‚3’, ‚1’,

‚1167’, ‚Residential’, ‚Wed May 21 00:00:00 EDT 2008’,

‚68212’, ‚38.478902’, ‚-121.431028’]
[‚2796 BRANCH ST’, ‚SACRAMENTO’, ‚95815’, ‚CA’, ‚2’, ‚1’,

‚796’, ‚Residential’, ‚Wed May 21 00:00:00 EDT 2008’,

‚68880’, ‚38.618305’, ‚-121.443839’]
[‚2805 JANETTE WAY’, ‚SACRAMENTO’, ‚95815’, ‚CA’, ‚2’,

‚1’, ‚852’, ‚Residential’, ‚Wed May 21 00:00:00 EDT

2008’, ‚69307’, ‚38.616835’, ‚-121.439146’] ...

In this example, you can see that we open the sample file using the with statement, and we use the opened file in the reader function. The reader function receives some useful args, as you can see above. The delimiter defines the column separator, in this case a comma. The dialect argument identifies a specific dialect (in this case the excel), and loads a set of parameters specific to this particular dialect. You can get the list of all registered di- alects using this command:

>>> csv.list_dialects() [‚excel-tab’, ‚excel’]

There are a number of extra arguments that you can pass the reader function, that you can check out in the CSV module page.

Once you have the row object, you can access each column by index (row[0]) or you can use the row’s iterator to your advantage and traverse the row’s columns in a for cycle for example.

Write

Writing data to a CSV file is fairly similar to reading data. You have a writer instead of a reader and you send the rows to the writer and close the file in the end. It’s as sim- ple as that:

>>> import csv
>>> with open(‚newfile.csv’, ‚wb’) as csvfile:

... ...

writer = csv.writer(csvfile, delimiter=’ ‚,

quotechar=’|’, quoting=csv.

QUOTE_MINIMAL)
spamwriter.writerow([‚Spam’, ‚Lovely Spam’,

‚Wonderful Spam’])

Looking at the example, we can see that it’s similar in many aspects to the reader, including the delimiter, and other arguments. The delimiter was already explained in the reader. As for the others, the quotechar is a one- character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters. It defaults to ‘ “ ‘. The quot- ing argument controls when the quotes are added, in this case, or when they should be read, when we are talking about the reader. As mentioned above, more arguments exist and can be used, so you should consider taking a look at the module documentation.

Simplejson

JSON is a human readable data format that became pop- ular in web development as an alternative to XML. It is mostly used to transmit data between client and server, but can also be used to store data. Python has a library to parse json data into Python data structures:

>>> import json

So, why do we need JSON? There are other ways to store and load data in Python: Pickle for example. Pickle allows the serialization and unserialization of data in py- thon. As I said in the last sentence, the “in python” part is very important. This data is only readable by Python, so it is not of much use for other system integrations… JSON in the other hand has gradually become one of the main information transmission formats, mainly in the web environment, but in many other contexts.

Generate JSON data from python

In order to generate a JSON data structure directly from python, we only need python’s default json module and the data structure we need to convert:

>>> import json
>>> data = {‚three’: 3, ‚five’: [1, 2, 3, 4, 5], ‚two’: 2,

‚one’: 1}
>>> json.dumps(data)
‚{„one”: 1, „five”: [1, 2, 3, 4, 5], „three”: 3, „two”: 2}’

It’s as simple as that! You are using Python after all…

Parse JSON data with python

As you are probably guessing right now, reading JSON data into Python is also extremely simple:

>> import json
>>> json_data = ‚{„one”: 1, „five”: [1, 2, 3, 4, 5],

„three”: 3, „two”: 2}’
>>> json.loads(json_data) {u’five’:[1,2,3,4,5],u’three’:3,u’two’:2,u’one’:1}

As you can see, working with JSON is extremely simple in Python.

Practical exercise

Now let’s try a bigger project. In this example we need to get some sample data. What we are looking for is a file with sentences (one per line). Fortunately there’s one here. As you can see, the file is a CSV file, so we already know how to process one, right?

Read file with a sentence per line

Ok, let’s start by reading the file, one sentence per line and store it in a list to be processed later:

>>>
>>>
>>>
...

...
...
...
>>> data[:10]
[[‚street’, ‚city’, ‚zip’, ‚state’, ‚beds’, ‚baths’, ‚sq__ft’, ‚type’, ‚sale_date’, ‚price’, ‚latitude’, ‚longitude’], [‚3526 HIGH ST’, ‚SACRAMENTO’, ‚95838’,
‚CA’, ‚2’, ‚1’, ‚836’, ‚Residential’, ‚Wed May 21 00:00:00 EDT 2008’, ‚59222’, ‚38.631913’, ‚-121.434879’], [‚51
OMAHA CT’, ‚SACRAMENTO’, ‚95823’, ‚CA’, ‚3’, ‚1’, ‚1167’, ‚Residential’, ‚Wed May 21 00:00:00 EDT 2008’, ‚68212’, ‚38.478902’, ‚-121.431028’], [‚2796 BRANCH ST’, ‚SACRAMENTO’, ‚95815’, ‚CA’, ‚2’, ‚1’, ‚796’, ‚Residential’, ‚Wed May 21 00:00:00 EDT 2008’, ‚68880’, ‚38.618305’, ‚-121.443839’], [‚2805 JANETTE WAY’, ‚SACRAMENTO’, ‚95815’, ‚CA’, ‚2’,

‚1’, ‚852’, ‚Residential’, ‚Wed May 21 00:00:00 EDT 2008’, ‚69307’, ‚38.616835’, ‚-121.439146’], [‚6001 MCMAHON DR’, ‚SACRAMENTO’, ‚95824’, ‚CA’, ‚2’, ‚1’, ‚797’, ‚Residential’, ‚Wed May 21 00:00:00 EDT 2008’, ‚81900’, ‚38.51947’, ‚-121.435768’], [‚5828 PEPPERMILL CT’, ‚SACRAMENTO’, ‚95841’, ‚CA’, ‚3’, ‚1’, ‚1122’, ‚Condo’, ‚Wed May 21 00:00:00 EDT 2008’, ‚89921’, ‚38.662595’, ‚-121.327813’], [‚6048 OGDEN NASH WAY’, ‚SACRAMENTO’, ‚95842’, ‚CA’, ‚3’, ‚2’, ‚1104’, ‚Residential’, ‚Wed May 21 00:00:00 EDT 2008’, ‚90895’, ‚38.681659’, ‚-121.351705’], [‚2561 19TH AVE’, ‚SACRAMENTO’, ‚95820’, ‚CA’, ‚3’, ‚1’, ‚1177’, ‚Residential’, ‚Wed May 21 00:00:00 EDT 2008’, ‚91002’, ‚38.535092’, ‚-121.481367’], [‚11150 TRINITY RIVER DR Unit 114’, ‚RANCHO CORDOVA’,

‚95670’, ‚CA’, ‚2’, ‚2’, ‚941’, ‚Condo’, ‚Wed May 21 00:00:00 EDT 2008’, ‚94905’, ‚38.621188’, ‚-121.270555’]]
>>>

Now that we have the data in a list, we can process it any way we like. Let’s move on to the next section so that we can manipulate each row and gather some da- ta from it.

Manipulate and gather metrics on each sentence

If you had the curiosity to observe the file contents before processing it, you found that in the file header we have the column names of the file data:

street, city, zip, state, beds, baths, sq__ft, type, sale_ date, price, latitude, longitude

Now, let’s separate the transactions by city and by type so that we can find out how many real estate properties of each type exist in each city.

If we think about it for a bit, we have to separate the data by city and, for each one, separate the data by type:

example = { ‘city_1’: {

‘type_1’: [property1, property2, property3], ‘type_2’: [property10, property22, property12],

}, ‘city_2’: {

‘type_1’: [property5, property7, property8] },

}

This is an example of a data structure that can handle our data, you can think of other ways to store the data, as long as you can get the statistical data requested above.

So let’s see how can we process the data in order to generate this structure:

>>> processed[‚ANTELOPE’]
{‚Residential’: [[‚3828 BLACKFOOT WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1088’, ‚Residential’, ‚Wed May 21 00:00:00 EDT 2008’, ‚126640’, ‚38.70974’, ‚-121.37377’], [‚5708 RIDGEPOINT DR’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚2’, ‚2’, ‚1043’, ‚Residential’, ‚Wed May 21 00:00:00 EDT 2008’, ‚161250’, ‚38.72027’, ‚-121.331555’], [‚4844 CLYDEBANK WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1215’, ‚Residential’, ‚Wed May 21 00:00:00 EDT 2008’, ‚182716’, ‚38.714609’, ‚-121.347887’], [‚7895 CABER WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1362’, ‚Residential’, ‚Wed May 21 00:00:00 EDT 2008’, ‚194818’, ‚38.711279’, ‚-121.393449’], [‚7837 ABBINGTON WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚4’, ‚2’, ‚1830’, ‚Residential’, ‚Wed May 21 00:00:00 EDT 2008’, ‚387731’, ‚38.709873’, ‚-121.339472’], [‚3228 BAGGAN CT’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1392’, ‚Residential’, ‚Tue May 20 00:00:00 EDT 2008’, ‚165000’, ‚38.715346’, ‚-121.388163’], [‚7863 CRESTLEIGH CT’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚2’, ‚2’, ‚1007’, ‚Residential’, ‚Tue May 20 00:00:00 EDT 2008’, ‚180000’, ‚38.710889’, ‚-121.358876’], [‚4437 MITCHUM CT’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1393’, ‚Residential’, ‚Tue May 20 00:00:00 EDT 2008’, ‚200000’, ‚38.704407’, ‚-121.36113’], [‚5312 MARBURY WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1574’, ‚Residential’, ‚Tue May 20 00:00:00 EDT 2008’, ‚255000’, ‚38.710221’, ‚-121.341651’], [‚5712 MELBURY CIR’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1567’, ‚Residential’, ‚Tue May 20 00:00:00 EDT 2008’, ‚261000’, ‚38.705849’, ‚-121.334701’], [‚8108 FILIFERA WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚4’, ‚3’, ‚1768’, ‚Residential’, ‚Tue May 20 00:00:00 EDT 2008’, ‚265000’, ‚38.717042’, ‚-121.35468’], [‚3318 DAVIDSON DR’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚1’, ‚988’, ‚Residential’, ‚Mon May 19 00:00:00 EDT 2008’, ‚223139’, ‚38.705753’, ‚-121.388917’], [‚4508 OLD DAIRY DR’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚4’, ‚3’, ‚2026’, ‚Residential’, ‚Mon May 19 00:00:00 EDT 2008’, ‚231200’, ‚38.72286’, ‚-121.358939’], [‚8721 SPRUCE RIDGE WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1187’, ‚Residential’, ‚Mon May 19 00:00:00 EDT 2008’, ‚234000’, ‚38.727657’, ‚-121.391028’], [‚3305 RIO ROCA CT’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚4’, ‚3’, ‚2652’, ‚Residential’, ‚Mon May 19 00:00:00 EDT 2008’, ‚239700’, ‚38.725079’, ‚-121.387698’], [‚5308 MARBURY WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1830’, ‚Residential’, ‚Mon May 19 00:00:00 EDT 2008’, ‚254172’, ‚38.710221’, ‚-121.341707’], [‚4712 PISMO BEACH DR’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚5’, ‚3’, ‚2346’, ‚Residential’, ‚Mon May 19 00:00:00 EDT 2008’, ‚320000’, ‚38.707705’, ‚-121.354153’], [‚4741 PACIFIC
PARK DR’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚5’, ‚3’, ‚2347’, ‚Residential’, ‚Mon May 19 00:00:00 EDT 2008’, ‚325000’, ‚38.709299’, ‚-121.353056’], [‚3361 ALDER CANYON WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚4’, ‚3’, ‚2085’, ‚Residential’,

>>>
>>>
...
...
...
...
...
...
...
...
...
...

processed = {} for row in data: city = row[1]

type = row[7]
if processed.has_key(city):

pr_city = processed[city] pr_type = pr_city.get(type,[]) pr_type.append(row) processed[city][type] = pr_type

else:
processed[city] = {type: [row]}

‚Mon May 19 00:00:00 EDT 2008’, ‚408431’, ‚38.727649’, ‚-121.385656’], [‚3536 SUN MAIDEN WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1711’, ‚Residential’, ‚Fri May 16 00:00:00 EDT 2008’, ‚161500’, ‚38.70968’, ‚-121.382328’], [‚4008 GREY LIVERY WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1669’, ‚Residential’, ‚Fri May 16 00:00:00 EDT 2008’, ‚168750’, ‚38.71846’, ‚-121.370862’], [‚8716 LONGSPUR WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1479’, ‚Residential’, ‚Fri May 16 00:00:00 EDT 2008’, ‚205000’, ‚38.724083’, ‚-121.3584’], [‚7901 GAZELLE TRAIL WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚4’, ‚2’, ‚1953’, ‚Residential’, ‚Fri May 16 00:00:00 EDT 2008’, ‚207744’, ‚38.71174’, ‚-121.342675’], [‚4085 COUNTRY DR’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚4’, ‚3’, ‚1915’, ‚Residential’, ‚Fri May 16 00:00:00 EDT 2008’, ‚240000’, ‚38.706209’, ‚-121.369509’], [‚8316 NORTHAM DR’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1235’, ‚Residential’, ‚Fri May 16 00:00:00 EDT 2008’, ‚246544’, ‚38.720767’, ‚-121.376678’], [‚4240 WINJE DR’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚4’, ‚2’, ‚2504’, ‚Residential’, ‚Fri May 16 00:00:00 EDT 2008’, ‚246750’, ‚38.70884’, ‚-121.359559’], [‚4636 TEAL BAY CT’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚4’, ‚2’, ‚2160’, ‚Residential’, ‚Fri May 16 00:00:00 EDT 2008’, ‚290000’, ‚38.704554’, ‚-121.354753’], [‚7921 DOE TRAIL WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚5’, ‚3’, ‚3134’, ‚Residential’, ‚Fri May 16 00:00:00 EDT 2008’, ‚315000’, ‚38.711927’, ‚-121.343608’], [‚4509 WINJE DR’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚2960’, ‚Residential’, ‚Fri May 16 00:00:00 EDT 2008’, ‚350000’, ‚38.709513’, ‚-121.359357’], [‚3604 KODIAK WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1206’, ‚Residential’, ‚Thu May 15 00:00:00 EDT 2008’, ‚142000’, ‚38.706175’, ‚-121.379776’], [‚8636 LONGSPUR WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1670’, ‚Residential’, ‚Thu May 15 00:00:00 EDT 2008’, ‚157296’, ‚38.725873’, ‚-121.35856’], [‚8428 MISTY PASS WAY’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚3’, ‚2’, ‚1517’, ‚Residential’, ‚Thu May 15 00:00:00 EDT 2008’, ‚212000’, ‚38.722959’, ‚-121.347115’]], ‚Condo’: [[‚8020 WALERGA RD’, ‚ANTELOPE’, ‚95843’, ‚CA’, ‚2’, ‚2’, ‚836’, ‚Condo’, ‚Mon May 19 00:00:00 EDT 2008’, ‚115000’, ‚38.71607’, ‚-121.364468’]]}

Now we have the data in the format that we want, but it is still not very readable. Let’s make a function to pretty print the data in a more human way:

>>> def pretty_print_data(data):
... for city in data:
... print „City: %s” % (city,)
... for type in data[city]:
... print „ Type: %s – %d” % (type,

len(data[city][type]))

Now, let’s try it and see some sample output:

>>> pretty_print_data(processed) City: ORANGEVALE

Type: Residential – 11 City: CITRUS HEIGHTS

Type: Residential – 32 Type: Condo – 2
Type: Multi-Family – 1

City: SACRAMENTO
Type: Residential – 402 Type: Condo – 27
Type: Multi-Family – 10

...

Output a file with the metrics obtained

We now have the statistical data. But what can we do with it? Let’s save it in a file, using the JSON format, so that it can be passed to other applications:

>>> import json
>>> with open(‚statistics.json’, ‚wb’) as f: ... json_data = json.dumps(processed)
... f.write(json_data)
...
>>>

And that’s it! Try to read the data from the newly created JSON file, so that you get the hang of it…


About the Author:

My name is Rui Silva and I’m a Python developer who loves open source. I started working as a freelancer in 2008, while I finished my graduation in Computer Science in Universidade do Minho. Af- ter my graduation, I started pursuing a master’s degree, choosing the field of parallel computation and mobile and ubiquitous com- puting. I ended up only finishing the mobile and ubiquitous com- puting course. In my 3 years of freelancing, I worked mostly with py- thon, developing django websites, drupal websites and some ma- gento stores. I also had to do some system administration. After that, I started working in Eurotux Informática, S.A. where I develop websites using Plone, django and drupal. I’m also an IOS developer and sometimes I perform some system administration tasks. Besides my job, I work as a freelancer using mainly django and other python frameworks.

0 responses on "Python Programming: The csv and json Python Module by Rui Silva"

Leave a Message

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

© HAKIN9 MEDIA SP. Z O.O. SP. K. 2013