astrobase.varclass.rfclass module¶

Does variable classification using random forests. Two types of classification are supported:

Variable classification using non-periodic features: this is used to perform a binary classification between non-variable and variable. Uses the features in astrobase.varclass.varfeatures and astrobase.varclass.starfeatures.
TODO: Periodic variable classification using periodic features: this is used to perform multi-class classification for periodic variables using the features in astrobase.varclass.periodicfeatures and astrobase.varclass.starfeatures. The classes recognized are listed in PERIODIC_VARCLASSES below and were generated from manual classification run on various HATNet, HATSouth and HATPI fields.

astrobase.varclass.rfclass.collect_nonperiodic_features(featuresdir, magcol, outfile, pklglob='varfeatures-*.pkl', featurestouse=['stetsonj', 'stetsonk', 'amplitude', 'magnitude_ratio', 'linear_fit_slope', 'eta_normal', 'percentile_difference_flux_percentile', 'mad', 'skew', 'kurtosis', 'mag_iqr', 'beyond1std', 'grcolor', 'gicolor', 'ricolor', 'bvcolor', 'jhcolor', 'jkcolor', 'hkcolor', 'gkcolor', 'propermotion'], maxobjects=None, labeldict=None, labeltype='binary')[source]¶

This collects variability features into arrays for use with the classifer.

Parameters:	featuresdir (str) – This is the directory where all the varfeatures pickles are. Use pklglob to specify the glob to search for. The varfeatures pickles contain objectids, a light curve magcol, and features as dict key-vals. The `astrobase.lcproc.lcvfeatures` module can be used to produce these. magcol (str) – This is the key in each varfeatures pickle corresponding to the magcol of the light curve the variability features were extracted from. outfile (str) – This is the filename of the output pickle that will be written containing a dict of all the features extracted into np.arrays. pklglob (str) – This is the UNIX file glob to use to search for varfeatures pickle files in featuresdir. featurestouse (list of str) – Each varfeatures pickle can contain any combination of non-periodic, stellar, and periodic features; these must have the same names as elements in the list of strings provided in featurestouse. This tries to get all the features listed in NONPERIODIC_FEATURES_TO_COLLECT by default. If featurestouse is provided as a list, gets only the features listed in this kwarg instead. maxobjects (int or None) – The controls how many pickles from the featuresdir to process. If None, will process all varfeatures pickles. labeldict (dict or None) – If this is provided, it must be a dict with the following key:val list: '<objectid>':<label value> for each objectid collected from the varfeatures pickles. This will turn the collected information into a training set for classifiers. Example: to carry out non-periodic variable feature collection of fake LCS prepared by `astrobase.fakelcs.generation`, use the value of the ‘isvariable’ dict elem from the fakelcs-info.pkl here, like so: labeldict={x:y for x,y in zip(fakelcinfo['objectid'], fakelcinfo['isvariable'])} labeltype ({'binary', 'classes'}) – This is either ‘binary’ or ‘classes’ for binary/multi-class classification respectively.
Returns:	This returns a dict with all of the features collected into np.arrays, ready to use as input to a scikit-learn classifier.
Return type:	dict

astrobase.varclass.rfclass.train_rf_classifier(collected_features, test_fraction=0.25, n_crossval_iterations=20, n_kfolds=5, crossval_scoring_metric='f1', classifier_to_pickle=None, nworkers=-1)[source]¶

This gets the best RF classifier after running cross-validation.

splits the training set into test/train samples
does KFold stratified cross-validation using RandomizedSearchCV
gets the RandomForestClassifier with the best performance after CV
gets the confusion matrix for the test set

Runs on the output dict from functions that produce dicts similar to that produced by collect_nonperiodic_features above.

Parameters:	collected_features (dict or str) – This is either the dict produced by a collect__features function or the pickle produced by the same. test_fraction* (float) – This sets the fraction of the input set that will be used as the test set after training. n_crossval_iterations (int) – This sets the number of iterations to use when running the cross-validation. n_kfolds (int) – This sets the number of K-folds to use on the data when doing a test-train split. crossval_scoring_metric (str) – This is a string that describes how the cross-validation score is calculated for each iteration. See the URL below for how to specify this parameter: http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter By default, this is tuned for binary classification and uses the F1 scoring metric. Change the crossval_scoring_metric to another metric (probably ‘accuracy’) for multi-class classification, e.g. for periodic variable classification. classifier_to_pickle (str) – If this is a string indicating the name of a pickle file to write, will write the trained classifier to the pickle that can be later loaded and used to classify data. nworkers (int) – This is the number of parallel workers to use in the RandomForestClassifier. Set to -1 to use all CPUs on your machine.
Returns:	A dict containing the trained classifier, cross-validation results, the input data set, and all input kwargs used is returned, along with cross-validation score metrics.
Return type:	dict

astrobase.varclass.rfclass.apply_rf_classifier(classifier, varfeaturesdir, outpickle, maxobjects=None)[source]¶

This applys an RF classifier trained using train_rf_classifier to varfeatures pickles in varfeaturesdir.

Parameters:	classifier (dict or str) – This is the output dict or pickle created by get_rf_classifier. This will contain a features_name key that will be used to collect the same features used to train the classifier from the varfeatures pickles in varfeaturesdir. varfeaturesdir (str) – The directory containing the varfeatures pickles for objects that will be classified by the trained classifier. outpickle (str) – This is a filename for the pickle that will be written containing the result dict from this function. maxobjects (int) – This sets the number of objects to process in varfeaturesdir.
Returns:	The classification results after running the trained classifier as returned as a dict. This contains predicted labels and their prediction probabilities.
Return type:	dict

astrobase.varclass.rfclass.plot_training_results(classifier, classlabels, outfile)[source]¶

This plots the training results from the classifier run on the training set.

plots the confusion matrix
plots the feature importances
FIXME: plot the learning curves too, see: http://scikit-learn.org/stable/modules/learning_curve.html

Parameters:	classifier (dict or str) – This is the output dict or pickle created by get_rf_classifier containing the trained classifier. classlabels (list of str) – This contains all of the class labels for the current classification problem. outfile (str) – This is the filename where the plots will be written.
Returns:	The path to the generated plot file.
Return type:	str