dataIO: Read Data, Save Result

AutoSaver, Auto-format and Save Prediction Results

In some machine learning competitions, like kaggle, a .csv file is required for result submission. Its format is usually illustrated in an example file. AutoSaver inspects the required format from the example, and save your result.

class Lutil.dataIO.AutoSaver(save_dir="", example_path=None, **default_kwargs)
Parameters:
  • save_dir (str) – Directory where your results will be saved
  • example_path (str) – Optional, path to the example .csv file
  • default_kwargs – Default keyword arguments arbitrarily used for DataFrame.to_csv()
Autosaver.save(self, X, filename, memo=None, **kwargs)
Parameters:
  • X (pd.DataFrame, pd.Series or np.ndarray) – The prediction result to be saved
  • filename (str) – Optional, the filename of the result file. Will use datetime.datetime.now().strftime(r"%m%d-%H%M%S") + ".csv" if left empty
  • memo (str) – Optional, the memo logged for this result
  • kwargs – Other keyword arguments arbitrarily used for DataFrame.to_csv()

Auto-format Examples

We have provided the simplest example in the welcome page. More format and data types can be inferred.

For example, if the index in the example starts from zero and there is no headers

0, 0.1
1, 0.1
2, 0.1

Run:

>>> import numpy as np
>>> from Lutil.dataIO import AutoSaver

>>> result = np.array([0.2, 0.4, 0.1, 0.5])
        # Typical output of a scikit-learn predictor

>>> ac = AutoSaver(save_dir="somedir", example_path="path/to/example.csv")
>>> ac.save(result, "some_name.csv")

Then in your somedir/some_name.csv:

0, 0.2
1, 0.4
2, 0.1
3, 0.5

Some competitions use a hash-like value as the index. For instance, in your example.csv:

hash, value
aaffc2, 0.1
spf2oa, 0.1
as2nw2, 0.1

Then you should include the index in the X parameter. However, how you achieve this is quite at will.

>>> import numpy as np
>>> import pandas as pd
>>> from Lutil.dataIO import AutoSaver

>>> index = ["aaffc2", "spf2oa", "as2nw2", "wn2ajn"]
>>> pred = np.array([0.2, 0.4, 0.1, 0.5])

>>> # In either of the four ways:
>>> result = pd.Series(pred, index=index)
>>> result = pd.DataFrame({
...     "ix": index,
...     "pred": pred
... })
>>> result = pd.DataFrame({"pred":pred}, index=index)
>>> result = np.array([index, pred]).T

>>> ac = AutoSaver(save_dir="somedir", example_path=r"explore\doctests\example.csv")
>>> ac.save(result, "some_name.csv")

In your somedir/some_name.csv, the results will be perfectly saved:

hash,value
aaffc2,0.2
spf2oa,0.4
as2nw2,0.1
wn2ajn,0.5

As long as the object you are saving is a numpy.ndarray or a pd.Series/pd.DataFrame, and it “looks like” the final csv file according to the example, the auto-format will work.

Log Memo for the Results

Sometimes you would like a memo, a description for the results you have saved. Then you can use the memo parameter in the AutoSaver.save method.

>>> result1 = np.array([0.2, 0.4, 0.1, 0.5])
>>> result2 = np.array([0.2, 0.3, 0.1, 0.6])

>>> ac = AutoSaver(save_dir="somedir", example_path="path/to/example.csv")

>>> ac.save(result1, "result1.csv", memo="Using Random Forest.")
>>> ac.save(result2, "result2.csv", memo="Using XGBoost.")

Then you will find this in your somedir/memo.txt:

result1.csv: Using Random Forest.
result2.csv: Using XGBoost.

All the new memos will be appended to the end of memo.txt.

Arbitrarily Using Keyword Arguments

If the format in your example.csv is too complex and AutoSaver failed to inspect that, you can also pass a pandas.Series or pandas.DataFrame to the save method, and arbitrarily assign arguments to use its to_csv method.

It is true that this is not very meaningful, comparing with directly calling DataFrame.to_csv, except that it gives you the access to our “memo” feature, and only have to set the parameters once while saving multiple results.

For example:

>>> df = pd.DataFrame({
...     "ix":[1,2,3],
...     "pred":[0.1,0.2,0.3]
... })

>>> ac = AutoSaver(save_dir="somedir", index=False)
>>> ac.save(df, "result1.csv")

This is equivalent to:

>>> df.to_csv("somedir/result1.csv", index=False)

You can also add more arguments when calling save:

>>> ac.save(df, "result2.csv", header=True)

Both the keyword arguments assigned when initializing and when calling save will be applied, which is equivalent to:

>>> df.to_csv("somedir/result2.csv", index=True, header=True)

When you use arbitrary arguments, you cannot use the example_path feature. They contradicts each other.

DataReader, Raw Data Management

If you want to read the dataset multiple times or across modules, it can be boring to copy-paste your pd.read_csv() statement. DataReader is a dataset manager which allows you to set the reading parameter only once, and get the dataset anytime after without more effort.

class Lutil.dataIO.DataReader(train_path=None, test_path=None, val_path=None, _id="default", read_func=None, **read_kwargs)
Parameters:
  • train_path (str) – Optional, path to the train set
  • test_path (str) – Optional, path to the test set
  • val_path (str) – Optional, path to the validation set
  • _id (str) – Optional, identifier for multiple datasets
  • read_func (callable) – Optional, function used for reading data, default pd.read_csv
  • read_kwargs – Other keyword arguments for applying to the read_func
DataReader.train(self)

Returns the train set.

DataReader.test(self)

Returns the test set.

DataReader.val(self)

Returns the validation set.

Basic Examples

By default, pandas.read_csv will be used to read csv datasets, whose path are assigned when initializing the DataReader object. You can also assign the parameters for read_csv when initializing.

>>> from Lutil.dataIO import DataReader

>>> reader = DataReader("path/to/train.csv",
...                     "path/to/test.csv",
...                     "path/to/val.csv", index_col=1)

>>> train = reader.train()

This is equivalent to:

>>> train = pd.read_csv("path/to/train.csv", index_col=1)

Likewise, you can also call

>>> test = reader.test()
>>> val = reader.val()

which are equivalent to:

>>> test = pd.read_csv("path/to/test.csv", index_col=1)
>>> val = pd.read_csv("path/to/val.csv", index_col=1)

Ever since you have initialized one instance, you can completely forget about the object and all parameter configurations. In the same runtime, even in other files, this will be able to retrieve the train set as before.

>>> DataReader().train()

It is the same for the test set and the validation set.

Accessing Multiple Datasets

Most small-scale machine learning tasks only have one dataset, which is our basic usage. However, if you want to access multiple datasets, you can assign the _id parameter. This will work accross files as well.

>>> DataReader("path/to/train_1.csv", _id="1", index_col=1)
>>> DataReader("path/to/train_2.csv", _id="2", nrows=500)

>>> train_1 = DataReader(_id="1").train()
>>> # Equivalent to
>>> train_1 = pd.read_csv("path/to/train_1.csv", index_col=1)

>>> train_2 = DataReader(_id="2").train()
>>> # Equivalent to
>>> train_2 = pd.read_csv("path/to/train_2.csv", nrows=500)

Using other Reading Function

If the data source is not a csv file, and you want to read them with other functions, you can pass a callable to the read_func parameter.

>>> import pandas as pd
>>> reader = DataReader("path/to/train.json", read_func=pd.read_json)
>>> train = reader.train()

This is equivalent to:

>>> train = pd.read_json("path/to/train.json")

Applying other keyword parameter is the same as before, pass them when initializing the DataReader object and it will be passed when actually calling the read_func.

As you see, this will only work if the dataset is stored in one file, and the read_func take the path as the first parameter.