astrobase.varclass.rfclass module

Does variable classification using random forests. Two types of classification are supported:

astrobase.varclass.rfclass.collect_nonperiodic_features(featuresdir, magcol, outfile, pklglob='varfeatures-*.pkl', featurestouse=['stetsonj', 'stetsonk', 'amplitude', 'magnitude_ratio', 'linear_fit_slope', 'eta_normal', 'percentile_difference_flux_percentile', 'mad', 'skew', 'kurtosis', 'mag_iqr', 'beyond1std', 'grcolor', 'gicolor', 'ricolor', 'bvcolor', 'jhcolor', 'jkcolor', 'hkcolor', 'gkcolor', 'propermotion'], maxobjects=None, labeldict=None, labeltype='binary')[source]

This collects variability features into arrays for use with the classifer.

Parameters:
  • featuresdir (str) – This is the directory where all the varfeatures pickles are. Use pklglob to specify the glob to search for. The varfeatures pickles contain objectids, a light curve magcol, and features as dict key-vals. The astrobase.lcproc.lcvfeatures module can be used to produce these.
  • magcol (str) – This is the key in each varfeatures pickle corresponding to the magcol of the light curve the variability features were extracted from.
  • outfile (str) – This is the filename of the output pickle that will be written containing a dict of all the features extracted into np.arrays.
  • pklglob (str) – This is the UNIX file glob to use to search for varfeatures pickle files in featuresdir.
  • featurestouse (list of str) – Each varfeatures pickle can contain any combination of non-periodic, stellar, and periodic features; these must have the same names as elements in the list of strings provided in featurestouse. This tries to get all the features listed in NONPERIODIC_FEATURES_TO_COLLECT by default. If featurestouse is provided as a list, gets only the features listed in this kwarg instead.
  • maxobjects (int or None) – The controls how many pickles from the featuresdir to process. If None, will process all varfeatures pickles.
  • labeldict (dict or None) –

    If this is provided, it must be a dict with the following key:val list:

    '<objectid>':<label value>
    

    for each objectid collected from the varfeatures pickles. This will turn the collected information into a training set for classifiers.

    Example: to carry out non-periodic variable feature collection of fake LCS prepared by astrobase.fakelcs.generation, use the value of the ‘isvariable’ dict elem from the fakelcs-info.pkl here, like so:

    labeldict={x:y for x,y in zip(fakelcinfo['objectid'],
                                  fakelcinfo['isvariable'])}
    
  • labeltype ({'binary', 'classes'}) – This is either ‘binary’ or ‘classes’ for binary/multi-class classification respectively.
Returns:

This returns a dict with all of the features collected into np.arrays, ready to use as input to a scikit-learn classifier.

Return type:

dict

astrobase.varclass.rfclass.train_rf_classifier(collected_features, test_fraction=0.25, n_crossval_iterations=20, n_kfolds=5, crossval_scoring_metric='f1', classifier_to_pickle=None, nworkers=-1)[source]

This gets the best RF classifier after running cross-validation.

  • splits the training set into test/train samples
  • does KFold stratified cross-validation using RandomizedSearchCV
  • gets the RandomForestClassifier with the best performance after CV
  • gets the confusion matrix for the test set

Runs on the output dict from functions that produce dicts similar to that produced by collect_nonperiodic_features above.

Parameters:
  • collected_features (dict or str) – This is either the dict produced by a collect_*_features function or the pickle produced by the same.
  • test_fraction (float) – This sets the fraction of the input set that will be used as the test set after training.
  • n_crossval_iterations (int) – This sets the number of iterations to use when running the cross-validation.
  • n_kfolds (int) – This sets the number of K-folds to use on the data when doing a test-train split.
  • crossval_scoring_metric (str) –

    This is a string that describes how the cross-validation score is calculated for each iteration. See the URL below for how to specify this parameter:

    http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter By default, this is tuned for binary classification and uses the F1 scoring metric. Change the crossval_scoring_metric to another metric (probably ‘accuracy’) for multi-class classification, e.g. for periodic variable classification.

  • classifier_to_pickle (str) – If this is a string indicating the name of a pickle file to write, will write the trained classifier to the pickle that can be later loaded and used to classify data.
  • nworkers (int) – This is the number of parallel workers to use in the RandomForestClassifier. Set to -1 to use all CPUs on your machine.
Returns:

A dict containing the trained classifier, cross-validation results, the input data set, and all input kwargs used is returned, along with cross-validation score metrics.

Return type:

dict

astrobase.varclass.rfclass.apply_rf_classifier(classifier, varfeaturesdir, outpickle, maxobjects=None)[source]

This applys an RF classifier trained using train_rf_classifier to varfeatures pickles in varfeaturesdir.

Parameters:
  • classifier (dict or str) – This is the output dict or pickle created by get_rf_classifier. This will contain a features_name key that will be used to collect the same features used to train the classifier from the varfeatures pickles in varfeaturesdir.
  • varfeaturesdir (str) – The directory containing the varfeatures pickles for objects that will be classified by the trained classifier.
  • outpickle (str) – This is a filename for the pickle that will be written containing the result dict from this function.
  • maxobjects (int) – This sets the number of objects to process in varfeaturesdir.
Returns:

The classification results after running the trained classifier as returned as a dict. This contains predicted labels and their prediction probabilities.

Return type:

dict

astrobase.varclass.rfclass.plot_training_results(classifier, classlabels, outfile)[source]

This plots the training results from the classifier run on the training set.

Parameters:
  • classifier (dict or str) – This is the output dict or pickle created by get_rf_classifier containing the trained classifier.
  • classlabels (list of str) – This contains all of the class labels for the current classification problem.
  • outfile (str) – This is the filename where the plots will be written.
Returns:

The path to the generated plot file.

Return type:

str